我有一个数据集,它显示了谁在哪个时间预订了哪个房间,它看起来像这样。
email room Start Date End Date
abc@corp.com L11M2 2021-02-01 08:00:00 2021-02-01 11:00:00
xyz@corp.com L12M4 2021-02-01 08:00:00 2021-02-01 10:00:00我想把它分成不同的小时,这样一行只包含一个小时的数据。这就是我想要的数据帧。
email room Start Date End Date
abc@corp.com L11M2 2021-02-01 08:00:00 2021-02-01 09:00:00
abc@corp.com L11M2 2021-02-01 09:00:00 2021-02-01 10:00:00
abc@corp.com L11M2 2021-02-01 10:00:00 2021-02-01 11:00:00
xyz@corp.com L12M4 2021-02-01 08:00:00 2021-02-01 09:00:00
xyz@corp.com L12M4 2021-02-01 09:00:00 2021-02-01 10:00:00有没有什么办法可以用python做这件事?
发布于 2021-08-27 10:03:22
下面是一个使用pandas.date_range和explode的简单解决方案
df['Start Date'] = df.apply(lambda d: pd.date_range(d['Start Date'],
d['End Date'],
freq='h')[:-1],
axis=1)
df = df.explode('Start Date')
df['End Date'] = df['Start Date'] + pd.Timedelta('1h')输出:
email room Start Date End Date
0 abc@corp.com L11M2 2021-02-01 08:00:00 2021-02-01 09:00:00
0 abc@corp.com L11M2 2021-02-01 09:00:00 2021-02-01 10:00:00
0 abc@corp.com L11M2 2021-02-01 10:00:00 2021-02-01 11:00:00
1 xyz@corp.com L12M4 2021-02-01 08:00:00 2021-02-01 09:00:00
1 xyz@corp.com L12M4 2021-02-01 09:00:00 2021-02-01 10:00:00发布于 2021-08-27 09:37:27
结合pandas melt和pyjanitor的complete可以帮助转换数据:
# pip install pyjanitor
import pandas as pd
import janitor
(df.melt(['email', 'room'], value_name = 'Start_Date')
.reindex([3,1,2,0])
# complete is a wrapper around pandas functions
# to expose missing values ... in this case it exposes the
# missing dates for each group in by
.complete([{'Start_Date':lambda df: pd.date_range(df.min(), df.max(),freq='H')}],
by=['email', 'room'])
.assign(End_Date = lambda df: df.Start_Date.add(pd.Timedelta('1 hour')))
.query('variable != "End Date"').drop(columns='variable'))
email room Start_Date End_Date
0 abc@corp.com L11M2 2021-02-01 08:00:00 2021-02-01 09:00:00
1 abc@corp.com L11M2 2021-02-01 09:00:00 2021-02-01 10:00:00
2 abc@corp.com L11M2 2021-02-01 10:00:00 2021-02-01 11:00:00
4 xyz@corp.com L12M4 2021-02-01 08:00:00 2021-02-01 09:00:00
5 xyz@corp.com L12M4 2021-02-01 09:00:00 2021-02-01 10:00:00发布于 2021-08-27 10:01:25
让我们创建一些示例数据
from datetime import datetime, timedelta
ref = now.replace(minute=0, second=0, microsecond=0)
def shifted(i): return ref + timedelta(hour=i)
df = pd.DataFrame([
('A', 'B', shifted(1), shifted(10)),
('C', 'D', shifted(-5), shifted(-1))],
columns=['name', 'email', 'start', 'end'])数据如下所示
name email start end
0 A B 2021-08-27 12:00:00 2021-08-27 21:00:00
1 C D 2021-08-27 06:00:00 2021-08-27 05:00:00您可以使用apply函数拆分每一行,确保返回一个pd.Series。
new_start = df.apply(lambda row: pd.Series(pd.date_range(row.start, row.end, freq='H')), axis=`).stack()在此之后,new_start是每小时的开始,有一个双索引,一个是原始索引,一个是特定块的顺序,也可能是有用的。
0 0 2021-08-27 12:00:00
1 2021-08-27 13:00:00
2 2021-08-27 14:00:00
3 2021-08-27 15:00:00
4 2021-08-27 16:00:00
5 2021-08-27 17:00:00
6 2021-08-27 18:00:00
7 2021-08-27 19:00:00
8 2021-08-27 20:00:00
9 2021-08-27 21:00:00
1 0 2021-08-27 06:00:00
1 2021-08-27 07:00:00
2 2021-08-27 08:00:00
3 2021-08-27 09:00:00
4 2021-08-27 10:00:00
dtype: datetime64[ns]现在,只需将此连接到原始框架。
res = df[["name", "email"]].join(
new_start.reset_index(1, drop=True).rename("start"))您可以像下面这样添加end列
res["end"] = res.start + timedelta(hours=1)https://stackoverflow.com/questions/68950904
复制相似问题