我有个小数据,就像这样。
import pandas as pd
import numpy as np
# data's stored in dictionary
details = {
'address_id': [1, 1, 1, 2, 2],
'business': ['verizon', 'verizon', 'comcast', 'sprint', 'att']
}
df = pd.DataFrame(details)
print(df)

我试图找出一个人是否和何时换到了不同的手机服务。
我尝试过这种逻辑,但没有奏效。
df['new'] = df.Column1.isin(df.Column1) & df[~df.Column2.isin(df.Column2)] 基本上,给定索引行0和行1,当address_id相同时,业务是相同的,但在索引行2中,业务从verizon更改为comcast。此外,给定索引行3和第4行,address_id是相同的,但业务在索引行4中从sprint更改为att。我想在dataframe中添加一个新列以标记这些更改。我怎么能这么做?
发布于 2022-08-29 22:29:04
UPDATE:@rickhg12hs的一条评论指出,我先前的回答(见下文),在检测一个人何时切换到新的手机服务时,不会处理一个人切换回先前服务的情况。
为了处理这种可能性,我们必须在另一个答案(@Pranav Hosangadi)中使用类似于逻辑的方法,尽管我的做法略有不同:
df['new'] = ( df
.groupby('address_id', sort=False)
.apply(lambda x: x.business != x.business.shift().bfill())
.reset_index(0).business )输入:
address_id business
0 1 verizon
1 1 verizon
2 1 comcast
3 2 sprint
4 2 att
5 2 sprint输出:
address_id business new
0 1 verizon False
1 1 verizon False
2 1 comcast True
3 2 sprint False
4 2 att True
5 2 sprint True性能比较:下面是600 k行和5列的测试代码,结果显示PREVIOUS UPDATE大约需要0.1秒才能识别new==True的333815行,而UPDATE则需要35秒才能找到335334行True行,这反映了大约0.5%的额外行,其中一个人已经切换了手机服务,然后又切换了回去。
rng = np.random.default_rng()
details = {
'address_id': rng.integers(1,100_000, size=600_000),
'business': [['verizon','comcast','sprint','att'][i] for i in rng.integers(0,3, size=600_000)],
'foo': 1,
'bar': 2
}
df = pd.DataFrame(details)
print('groupby() ...')
start = datetime.now()
x = ( df
.groupby('address_id', sort=False) )
print(f'... complete after {datetime.now() - start} time elapsed.')
print('apply() ...')
start = datetime.now()
x = ( x
.apply(lambda x: x.business != x.business.shift().bfill()) )
print(f'... complete after {datetime.now() - start} time elapsed.')
print('reset_index() ...')
start = datetime.now()
df['new'] = ( x
.reset_index(0).business )
print(f'... complete after {datetime.now() - start} time elapsed.')
print(df)
print('rows with "new" == True', df.new.sum())
df = pd.DataFrame(details)
print('PREVIOUS UPDATE() ...')
start = datetime.now()
df['new'] = df.address_id.map(df.groupby('address_id').first().business) != df.business
print(f'... complete after {datetime.now() - start} time elapsed.')
print('rows with "new" == True', df.new.sum())结果:
groupby() ...
... complete after 0:00:00 time elapsed.
apply() ...
... complete after 0:00:33.541322 time elapsed.
reset_index() ...
... complete after 0:00:00.040942 time elapsed.
address_id business foo bar new
0 20223 sprint 1 2 False
1 29297 comcast 1 2 False
2 92489 comcast 1 2 False
3 29297 verizon 1 2 True
4 98901 comcast 1 2 False
... ... ... ... ... ...
599995 29823 comcast 1 2 True
599996 39328 comcast 1 2 True
599997 27594 comcast 1 2 False
599998 14903 sprint 1 2 True
599999 87375 verizon 1 2 True
[600000 rows x 5 columns]
rows with "new" == True 335334
PREVIOUS UPDATE() ...
... complete after 0:00:00.097930 time elapsed.
rows with "new" == True 333815以前的更新:这里有一个比我最初的答案更简单的方法,使用join() (参见下文)来完成您的问题:
df['new'] = df.address_id.map(df.groupby('address_id').first().business) != df.business解释:
address_id
groupby()和first()创建一个business列,该列包含每个groupby() Series.map()遇到的第一个business列,以将原始数据帧的address_id列转换为这个第一个business值
H 129添加列new,即d31只有当这个新的business与原来的business列不同时,H 234F 235
原始解
下面是一种使用groupby()和join()来完成您要求的简单方法
df = df.join(df.groupby('address_id').first(), on='address_id', rsuffix='_first')
df = df.assign(new=df.business != df.business_first).drop(columns='business_first')f输出:
address_id business new
0 1 verizon False
1 1 verizon False
2 1 comcast True
3 2 sprint False
4 2 att True解释:
address_id
Use
使用groupby()和first()创建数据,其business列包含每个address_id
Use join()遇到的第一个列business_first到df,该列包含相应的第一项业务,以便为每个assign()添加一个列new,其中包含一个布尔值,指示该行是否包含新的<address_idUse>D58和现有的address_idUse drop(),以消除business_first列。H 264F 265发布于 2022-08-29 22:15:39
首先,groupby on address_id。
groups = df.groupby("address_id")然后,对组进行迭代,并找到business值变化的位置:
for address_id, grp_data in groups:
changed = grp_data['business'].ne(grp_data['business'].shift().bfill())
df.loc[grp_data.index, "changed"] = changed.shift().bfill()将所有数据一个索引移到上面(0 -> 1、1 -> 2等),然后回填第一个值。例如:
>>> df["business"]
0 verizon
1 verizon
2 comcast
3 sprint
4 att
Name: business, dtype: object
>>> df["business"].shift()
0 NaN
1 verizon
2 verizon
3 comcast
4 sprint
Name: business, dtype: object
>>> df["business"].shift().bfill()
0 verizon
1 verizon
2 verizon
3 comcast
4 sprint
Name: business, dtype: object运行该循环将生成以下数据:
address_id business changed
0 1 verizon False
1 1 verizon False
2 1 comcast True
3 2 sprint False
4 2 att Truehttps://stackoverflow.com/questions/73535107
复制相似问题