我正在按县编制一张前三名作物的表。有些县有相同的作物品种,顺序相同。其他县的作物品种顺序也不同。
df1 = pd.DataFrame( {
"County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] ,
"Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
"Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
"Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"],
"Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )
我可以在Crop1、Crop2和Crop3上做一个组,得到total_pop的和:
df1_grouped=df1.groupby(['Crop1',"Crop2","Crop3"])['Total_pop'].sum().reset_index()
这就给出了具体作物组合的总数:
df1_grouped
apples melons grain 1500
grain melons apples 2000
melons grain apples 4500
pears carrots raddish 6700
raddish pears carrots 2500
不过,我想要的是,不管列出的作物是crop1、crop2还是crop3,都能得到不同作物组合的总数量。预期的结果是:
apples melons grain 8000
pears carrots raddish 9200
谢谢你的指导。
发布于 2019-02-17 20:54:06
方法1:
组合crop
列
>>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],
... x['Crop2'],
... x['Crop3']]),axis=1)
>>> df1.head()
County Crop1 Crop2 Crop3 Total_pop combined_temp
0 Harney grain melons apples 2000 [grain, melons, apples]
1 Baker melons grain apples 1500 [melons, grain, apples]
2 Wheeler melons grain apples 3000 [melons, grain, apples]
3 Hood River apples melons grain 1500 [apples, melons, grain]
4 Wasco pears carrots raddish 2000 [pears, carrots, raddish]
使它成为一个排序元组
>>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)
>>> df1.head()
County Crop1 Crop2 ... Total_pop combined_temp sorted
0 Harney grain melons ... 2000 [grain, melons, apples] (apples, grain, melons)
1 Baker melons grain ... 1500 [melons, grain, apples] (apples, grain, melons)
2 Wheeler melons grain ... 3000 [melons, grain, apples] (apples, grain, melons)
3 Hood River apples melons ... 1500 [apples, melons, grain] (apples, grain, melons)
4 Wasco pears carrots ... 2000 [pears, carrots, raddish] (carrots, pears, raddish)
然后通过手术进入正常组。
>>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()
>>> df1_grouped
sorted Total_pop
0 (apples, grain, melons) 8000
1 (carrots, pears, raddish) 9200
方法2: --基于answer的aws-apprentice短版本
df = df1.copy()
grouping_cols = ['Crop1', 'Crop2', 'Crop3']
df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols] \
.apply(set, axis=1) \
.apply(sorted)
.values \
.tolist(), columns=grouping_cols)
>>> df.head()
County Crop1 Crop2 Crop3 Total_pop
0 Harney apples grain melons 2000
1 Baker apples grain melons 1500
2 Wheeler apples grain melons 3000
3 Hood River apples grain melons 1500
4 Wasco carrots pears raddish 2000
现在一组一组地
>>> df.groupby(grouping_cols).Total_pop.sum()
Crop1 Crop2 Crop3
apples grain melons 8000
carrots pears raddish 9200
Name: Total_pop, dtype: int64
但我个人更喜欢this answer using numpy
发布于 2019-02-17 21:10:48
由于您的数据似乎保证每个国家有3种独特的作物(“我正在按县编制前3种作物的表。”),所以对值进行排序并重新分配就足够了。
import numpy as np
cols = ['Crop1', 'Crop2', 'Crop3']
df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
County Crop1 Crop2 Crop3 Total_pop
0 Harney apples grain melons 2000
1 Baker apples grain melons 1500
2 Wheeler apples grain melons 3000
3 Hood River apples grain melons 1500
4 Wasco carrots pears raddish 2000
5 Morrow carrots pears raddish 2500
6 Union carrots pears raddish 2700
7 Lake carrots pears raddish 2000
然后总结一下:
df1.groupby(cols).sum()
# Total_pop
#Crop1 Crop2 Crop3
#apples grain melons 8000
#carrots pears raddish 9200
好处是避免使用Series.apply
或.apply(axis=1)
。对于较大的DataFrames
,性能差异是显而易见的:
df1 = pd.concat([df1]*10000, ignore_index=True)
cols = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
to_sum = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
发布于 2019-02-17 20:54:04
这里有一种方法。
首先,让我们跨列获取唯一值,然后将这些值重新分配回DataFrame。我们将在原始数据的副本上执行此操作,因为您可能需要保留原始数据。
df = df1.copy()
to_sum = ['Crop1', 'Crop2', 'Crop3']
df[to_sum] = pd.DataFrame(df.loc[:, to_sum] \
.apply(set, axis=1) \
.apply(sorted) \
.values \
.tolist(), columns=to_sum)
print(df)
County Crop1 Crop2 Crop3 Total_pop
0 Harney grain apples melons 2000
1 Baker grain apples melons 1500
2 Wheeler grain apples melons 3000
3 Hood River grain apples melons 1500
4 Wasco pears carrots raddish 2000
5 Morrow pears carrots raddish 2500
6 Union pears carrots raddish 2700
7 Lake pears carrots raddish 2000
现在,我们可以执行我们的groupby
以获得所需的结果。
df.groupby(to_sum).Total_pop.sum()
Crop1 Crop2 Crop3
apples grain melons 8000
carrots pears raddish 9200
Name: Total_pop, dtype: int64
https://stackoverflow.com/questions/54737348
复制相似问题