我有一个列表,上面写着:list=['199.72.81.55', 'burger.letters.com']。我现在想要的就是从我的数据帧中获取匹配值。例如:当我搜索burger.letters.com时,我的数据帧应该返回burger.letters.com的主机和时间戳。我尝试过这样做:df.ix[host] for host in list然而,由于我在df.ix[host]上执行forloop就有4亿行,所以需要超过30分钟。
当我在代码下面运行时,它永远都会耗费我的时间。
下面是我的数据帧的样子:
host timestamp
0 199.72.81.55 01/Jul/1995:00:00:01
2 199.72.81.55 01/Jul/1995:00:00:09
3 burger.letters.com 01/Jul/1995:00:00:11
4 199.72.81.55 01/Jul/1995:00:00:12
5 199.72.81.55 01/Jul/1995:00:00:13
6 199.72.81.55 01/Jul/1995:00:00:14
8 burger.letters.com 01/Jul/1995:00:00:15
9 199.72.81.55 01/Jul/1995:00:00:15我想要如下所需的输出:
for host in hostlist:
df.ix[host]
So this operation returns below: but too heavy as I have 0.4 billion rows. And want to optimize this.
df.ix['burger.letters.com']
host timestamp
3 burger.letters.com 01/Jul/1995:00:00:11
8 burger.letters.com 01/Jul/1995:00:00:15
df.ix['199.72.81.55']
host timestamp
0 199.72.81.55 01/Jul/1995:00:00:01
2 199.72.81.55 01/Jul/1995:00:00:09
4 199.72.81.55 01/Jul/1995:00:00:12
5 199.72.81.55 01/Jul/1995:00:00:13
6 199.72.81.55 01/Jul/1995:00:00:14
9 199.72.81.55 01/Jul/1995:00:00:15下面是我的代码://takes more than 30minutes
list(map(block, failedIP_list))
def block(host):
temp_df = failedIP_df.ix[host]
if len(temp_df) > 3:
time_values = temp_df.set_index(keys='index')['timestamp']
if (return_seconds(time_values[2:3].values[0]) - return_seconds(time_values[0:1].values[0]))<=20:
blocked_host.append(time_values[3:].index.tolist())如果有人能帮上忙,我将不胜感激。
发布于 2017-04-04 01:36:55
你的问题很模糊。这是我认为你想要的:
def my_function(df):
# this function should operate on a dataframe
# that is a subset of your original
return dfcopy
new_df = (
df.groupby(by=['host'])
.filter(lambda g: g.shape[0] > 3
.groupby(by=['host'])
.apply(my_function)
)groupby/filter将删除少于3个项目的组。然后,我们使用groupby/apply对具有相同host值的所有剩余组进行操作。
https://stackoverflow.com/questions/43190584
复制相似问题