我有一个很大的csv文件,格式如下:
date event-type case event
2020-07-23 00:00:00.000257032 wake-up 0 patient wakes
2020-07-23 10:30:00.005042270 meal 1 patient has breakfast
2020-07-23 10:32:30.088683558 lavatory 2 1st - sample collected
我有大约60万条这样的条目。
case列中的值事先不存在。
问题是-随着date列中每一分钟的变化,如何在case列中插入一个唯一的数字,如:
date case
2020-07-23 10:30:00.005042270 1
2020-07-23 10:31:00.005042270 2
2020-07-23 10:32:00.005042270 3
此外,除了以分钟为单位的更改外,所有其他更改都将被忽略,例如,只要日期列中的时间为10:30,则在日期列中输入的行数将继续为1,直到10:31出现在日期列中。
作为python的新手,我不确定如何做到这一点。
发布于 2020-07-25 15:59:26
试试这个:
from datetime import datetime
df = df.sort_values('date', ascending=True).reset_index(drop=True)
date_to_minute = df['date'].map(lambda d: datetime.strptime(d[:-3],'%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d %H:%M'))
previous_date_time = date_to_minute[0]
#if you want case column to start from 1, change this variable to 1
current_case = 0
cases = []
for current_date_time in date_to_minute:
if current_date_time > previous_date_time:
current_case += 1
cases.append(current_case)
previous_date_time = current_date_time #missed adding this line previously
df['cases'] = pd.Series(cases, name='cases')
发布于 2020-07-25 15:40:49
我假设您的数据帧是按date
排序的。试试这个:
# pandas store Timestamp internally as nanosecond
# You first need to convert it to minutes since epoch (Jan 1, 1970)
minutes = pd.to_datetime(df['date']).astype('int64') // (60 * 10**9)
# Every new minute makes a new case number
df['case'] = minutes.diff().gt(0).cumsum()
https://stackoverflow.com/questions/63089961
复制相似问题