以下是我所拥有的数据
timediff text
2018-06-19 01:00:00 A
2018-06-19 01:00:01
2018-06-19 01:00:02
2018-06-19 01:00:03
2018-06-19 02:00:00
2018-06-19 02:00:01 B
2018-06-19 02:00:02
2018-06-19 02:00:03
2018-06-19 02:15:00
2018-06-19 02:15:01
2018-06-19 02:15:02
2018-06-19 02:15:03 C
2018-06-19 02:30:00
2018-06-19 02:30:01
2018-06-19 02:30:02
2018-06-19 02:30:03 D
我想根据timediff列填充文本。如果timediff在1-5秒内,我希望用文本填充它。例如,如果前四个条目的timediff在4-5秒内,我需要用"A“填充文本。类似地,我希望填充所有其他行。输出如下所示,
timediff text
2018-06-19 01:00:00 A
2018-06-19 01:00:01 A
2018-06-19 01:00:02 A
2018-06-19 01:00:03 A
2018-06-19 02:00:00 B
2018-06-19 02:00:01 B
2018-06-19 02:00:02 B
2018-06-19 02:00:03 B
2018-06-19 02:15:00 C
2018-06-19 02:15:01 C
2018-06-19 02:15:02 C
2018-06-19 02:15:03 C
2018-06-19 02:30:00 D
2018-06-19 02:30:01 D
2018-06-19 02:30:02 D
2018-06-19 02:30:03 D
由于这些都是在几秒钟内完成的,因此我希望根据它来填充列。
我尝试了两个方向的时间差,并尝试了回填或正面填充选项,但它对我不起作用。
有人能帮我做这件事吗?
发布于 2018-06-20 06:53:02
除了你发布的内容之外,我不确定你的数据是什么样子的,所以我不得不做一些假设(将空值设为'‘,并将日期时间字符串转换为日期时间)。这应该会让你走上正轨:
dat = [
['2018-06-19 01:00:00', 'A'],
['2018-06-19 01:00:01', ''],
['2018-06-19 01:00:02', ''],
['2018-06-19 01:00:03', ''],
['2018-06-19 02:00:00', ''],
['2018-06-19 02:00:01', 'B'],
['2018-06-19 02:00:02', ''],
['2018-06-19 02:00:03', ''],
['2018-06-19 02:15:00', ''],
['2018-06-19 02:15:01', ''],
['2018-06-19 02:15:02', ''],
['2018-06-19 02:15:03', 'C'],
['2018-06-19 02:30:00', ''],
['2018-06-19 02:30:01', ''],
['2018-06-19 02:30:02', ''],
['2018-06-19 02:30:03', 'D']]
df = pd.DataFrame(dat, columns=['timediff', 'text'])
df['timediff'] = pd.to_datetime(df['timediff'])
dff = df[df['text'] != '']
df['text_new'] = df.apply(lambda dr: dff['text'][(dff['timediff']-dr['timediff']).abs().argmin()], axis=1)
或者,对于大量数据,下面的方法应该更快(尽管不是很优雅)。
# Use same code as above before we define dff.
dff = df[df['text'] != ''].copy() # Making a copy is safer here
df['text_new'] = ''
j = 0
for i in range(len(df)):
while dff.iloc[j+1]['timediff'] < df.loc[i,'timediff']:
j += 1
df.loc[i,'text_new'] = dff.loc[(dff.iloc[j:j+2]['timediff']-df.loc[i,'timediff']).abs().idxmin()]['text']
结果df:
timediff text text_new
0 2018-06-19 01:00:00 A A
1 2018-06-19 01:00:01 A
2 2018-06-19 01:00:02 A
3 2018-06-19 01:00:03 A
4 2018-06-19 02:00:00 B
5 2018-06-19 02:00:01 B B
6 2018-06-19 02:00:02 B
7 2018-06-19 02:00:03 B
8 2018-06-19 02:15:00 C
9 2018-06-19 02:15:01 C
10 2018-06-19 02:15:02 C
11 2018-06-19 02:15:03 C C
12 2018-06-19 02:30:00 D
13 2018-06-19 02:30:01 D
14 2018-06-19 02:30:02 D
15 2018-06-19 02:30:03 D D
https://stackoverflow.com/questions/50937822
复制相似问题