我有以下熊猫数据帧:
Date Variable
2018-04-10 21:05:00 a
2018-04-10 21:05:00 a
2018-04-10 21:10:00 b
2018-04-10 21:15:00 a
2018-04-10 21:35:00 b
2018-04-10 21:45:00 a
2018-04-10 21:45:00 a
我的目标是计算每次分析之前30分钟和之后30分钟的包含'a'
的行数(包括前后时间相同的行,但不包括正在分析的每一行)。然后对每个Variable
执行相同的操作。因此,对于Variable
a
,我的最终结果如下所示:
Date nr_30_min_bef_a nr_30_min_after_a
2018-04-10 21:05:00 1 2
2018-04-10 21:05:00 1 2
2018-04-10 21:10:00 2 1
2018-04-10 21:15:00 2 2
2018-04-10 21:35:00 3 2
2018-04-10 21:45:00 2 1
2018-04-10 21:45:00 2 1
我尝试执行for循环来迭代所有行,问题是整个系列有超过一百万行,因此我正在寻找一个更有效的解决方案。
import pandas as pd
df = pd.DataFrame({'Date': ['2018-04-10 21:05:00',
'2018-04-10 21:05:00',
'2018-04-10 21:10:00',
'2018-04-10 21:15:00',
'2018-04-10 21:35:00',
'2018-04-10 21:45:00',
'2018-04-10 21:45:00'],
'Variable': ['a', 'a', 'b', 'a', 'b', 'a', 'a']})
提前谢谢。
发布于 2018-06-11 03:48:57
在此previous answer的基础上,您可以使用
import pandas as pd
df = pd.DataFrame({'Date': ['2018-04-10 21:05:00',
'2018-04-10 21:05:00',
'2018-04-10 21:10:00',
'2018-04-10 21:15:00',
'2018-04-10 21:35:00',
'2018-04-10 21:45:00',
'2018-04-10 21:45:00'],
'Variable': ['a', 'a', 'b', 'a', 'b', 'a', 'a']})
df['Date'] = pd.to_datetime(df['Date'])
freq_table = pd.crosstab(index=df['Date'], columns=df['Variable'])
df_bef = freq_table.rolling('30T', closed='both').sum().astype(int)
is_current = (freq_table != 0).astype(int)
df_bef -= is_current
df_bef.columns = ['nr_30_min_bef_{}'.format(col) for col in df_bef.columns]
result = pd.merge(df, df_bef, left_on='Date', right_index=True)
max_date = df['Date'].max()
min_date = df['Date'].min()
pseudo_dates = (max_date - df['Date'])[::-1] + min_date
freq_table_reversed = pd.crosstab(index=pseudo_dates, columns=df['Variable'])
df_after = freq_table_reversed.rolling('30T', closed='both').sum().astype(int)
df_after = pd.DataFrame(df_after.values[::-1], index=freq_table.index,
columns=df_after.columns)
df_after -= is_current
df_after.columns = ['nr_30_min_after_{}'.format(col) for col in df_after.columns]
result = pd.merge(result, df_after, left_on='Date', right_index=True)
print(result)
哪种打印
Date Variable nr_30_min_bef_a nr_30_min_bef_b nr_30_min_after_a nr_30_min_after_b
0 2018-04-10 21:05:00 a 1 0 2 2
1 2018-04-10 21:05:00 a 1 0 2 2
2 2018-04-10 21:10:00 b 2 0 1 1
3 2018-04-10 21:15:00 a 2 1 2 1
4 2018-04-10 21:35:00 b 3 1 2 0
5 2018-04-10 21:45:00 a 2 1 1 0
6 2018-04-10 21:45:00 a 2 1 1 0
主要的新想法是使用pd.crosstab
来生成频率表:
freq_table = pd.crosstab(index=df['Date'], columns=df['Variable'])
# Variable a b
# Date
# 2018-04-10 21:05:00 2 0
# 2018-04-10 21:10:00 0 1
# 2018-04-10 21:15:00 1 0
# 2018-04-10 21:35:00 0 1
# 2018-04-10 21:45:00 2 0
然后对每个滚动窗口中的数字求和:
df_bef = freq_table.rolling('30T', closed='both').sum().astype(int)
由于您希望从计数中排除当前行,因此将从df_bef
中减去is_current
is_current = (freq_table != 0).astype(int)
df_bef -= is_current
https://stackoverflow.com/questions/50786484
复制相似问题