我想实现这里所描述的:stackoverflow question,但只使用标准熊猫。
我有两条数据:第一条
first_employee target_employee relationship
0 Andy Claude 0
1 Andy Frida 20
2 Andy Georgia -10
3 Andy Joan 30
4 Andy Lee -10
5 Andy Pablo -10
6 Andy Vincent 20
7 Claude Frida 0
8 Claude Georgia 90
9 Claude Joan 0
10 Claude Lee 0
11 Claude Pablo 10
12 Claude Vincent 0
13 Frida Georgia 0
14 Frida Joan 0
15 Frida Lee 0
16 Frida Pablo 50
17 Frida Vincent 60
18 Georgia Joan 0
19 Georgia Lee 10
20 Georgia Pablo 0
21 Georgia Vincent 0
22 Joan Lee 70
23 Joan Pablo 0
24 Joan Vincent 10
25 Lee Pablo 0
26 Lee Vincent 0
27 Pablo Vincent -20
第二:
first_employee target_employee book_count
0 Vincent Frida 2
1 Vincent Pablo 1
2 Andy Claude 1
3 Andy Joan 1
4 Andy Pablo 1
5 Andy Lee 1
6 Andy Frida 1
7 Andy Georgia 1
8 Claude Georgia 3
9 Joan Lee 3
10 Pablo Frida 2
我想加入这两个数据格式,这样我的最后一个数据就和第一个数据完全相同了,但是它也有带有相应值的book_count
列(如果没有可用的话,还有NaN )。
我已经写了这样的东西:joined_df = first_df.merge(second_df, on = ['first_employee', 'target_employee'], how = 'outer')
和我得到:
first_employee target_employee relationship book_count
0 Andy Claude 0.0 1.0
1 Andy Frida 20.0 1.0
2 Andy Georgia -10.0 1.0
3 Andy Joan 30.0 1.0
4 Andy Lee -10.0 1.0
5 Andy Pablo -10.0 1.0
6 Andy Vincent 20.0 NaN
7 Claude Frida 0.0 NaN
8 Claude Georgia 90.0 3.0
9 Claude Joan 0.0 NaN
10 Claude Lee 0.0 NaN
11 Claude Pablo 10.0 NaN
12 Claude Vincent 0.0 NaN
13 Frida Georgia 0.0 NaN
14 Frida Joan 0.0 NaN
15 Frida Lee 0.0 NaN
16 Frida Pablo 50.0 NaN
17 Frida Vincent 60.0 NaN
18 Georgia Joan 0.0 NaN
19 Georgia Lee 10.0 NaN
20 Georgia Pablo 0.0 NaN
21 Georgia Vincent 0.0 NaN
22 Joan Lee 70.0 3.0
23 Joan Pablo 0.0 NaN
24 Joan Vincent 10.0 NaN
25 Lee Pablo 0.0 NaN
26 Lee Vincent 0.0 NaN
27 Pablo Vincent -20.0 NaN
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
这与我想要达到的目标有点接近。但是,first_employee
和target_employee
中的值的排序是不相关的,所以如果在第一个dataframe中有(Frida,Vincent)
,在第二个(Vincent, Frida)
中,这两个值应该合并在一起(重要的是值,而不是按列排列的顺序)。
在我生成的dataframe中,我得到了三行额外的数据:
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
这是我合并的结果,该合并考虑了“有序”值列--明智地进行连接:这3行额外的行应该在已经可用的(Frida, Vincent)
、(Pablo, Vincent)
和(Frida, Pablo)
上合并。
是否只有标准的pandas
函数才能做到这一点?(我在开头引用的问题使用了sqldf
)
发布于 2021-12-04 15:35:07
我相信这就是你要找的。使用np.sort
将更改每一行前两列的顺序,因此它是按字母顺序排列的,从而允许合并正确工作。
cols = ['first_employee','target_employee']
df[cols] = np.sort(df[cols].to_numpy(),axis=1)
df2[cols] = np.sort(df2[cols].to_numpy(),axis=1)
ndf = pd.merge(df,df2,on = cols,how='left')
发布于 2021-12-04 15:39:24
首先用一个排序的元组创建一个key
,然后对其进行合并:
create_key = lambda x: tuple(sorted([x['first_employee'], x['target_employee']]))
out = pd.merge(df1.assign(_key=df1.apply(create_key, axis=1)),
df2.assign(_key=df2.apply(create_key, axis=1)),
on='_key', suffixes=('', '_key'), how='outer') \
.loc[:, lambda x: ~x.columns.str.endswith('_key')]
print(out)
# Output:
first_employee target_employee relationship book_count
0 Andy Claude 0 1.0
1 Andy Frida 20 1.0
2 Andy Georgia -10 1.0
3 Andy Joan 30 1.0
4 Andy Lee -10 1.0
5 Andy Pablo -10 1.0
6 Andy Vincent 20 NaN
7 Claude Frida 0 NaN
8 Claude Georgia 90 3.0
9 Claude Joan 0 NaN
10 Claude Lee 0 NaN
11 Claude Pablo 10 NaN
12 Claude Vincent 0 NaN
13 Frida Georgia 0 NaN
14 Frida Joan 0 NaN
15 Frida Lee 0 NaN
16 Frida Pablo 50 2.0
17 Frida Vincent 60 2.0
18 Georgia Joan 0 NaN
19 Georgia Lee 10 NaN
20 Georgia Pablo 0 NaN
21 Georgia Vincent 0 NaN
22 Joan Lee 70 3.0
23 Joan Pablo 0 NaN
24 Joan Vincent 10 NaN
25 Lee Pablo 0 NaN
26 Lee Vincent 0 NaN
27 Pablo Vincent -20 1.0
https://stackoverflow.com/questions/70226842
复制相似问题