我有一个名为data的数据帧:
Subjects Professor StudentID
8 Chemistry Jane 999
1 Chemistry Jane 3455
0 Chemistry Joseph 1234
2 History Jane 3455
6 History Smith 323
7 History Smith 999
3 Mathematics Doe 56767
10 Mathematics Einstein 3455
5 Physics Einstein 2834
9 Physics Smith 323
4 Physics Smith 999我想运行这个查询“至少有2个班级的教授有2个或更多相同的学生”。期望输出
Smith: Physics, History, 323, 999我熟悉SQL,可以很容易地做到这一点,但我仍然是Python的初学者。如何在Python中实现此输出?另一种思路是将这个数据帧转换为SQL数据库,并通过python拥有一个SQL接口来运行查询。有没有办法做到这一点?
发布于 2016-09-23 14:05:50
使用filter和value_counts的解决方案
df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) &
((x.StudentID.value_counts() > 1).sum() > 1))
print (df1)
Subjects Professor StudentID
6 History Smith 323
7 History Smith 999
9 Physics Smith 323
4 Physics Smith 999并使用duplicated
df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) &
(x.StudentID.duplicated().sum() > 1))
print (df1)
Subjects Professor StudentID
6 History Smith 323
7 History Smith 999
9 Physics Smith 323
4 Physics Smith 999按注释编辑:
您可以从自定义函数返回自定义输出,然后通过Series.dropna删除NaN行
df.StudentID = df.StudentID.astype(str)
def f(x):
if (len(x.Subjects) > 1) & (x.StudentID.duplicated().sum() > 1):
return ', '.join((x.Subjects.unique().tolist() + x.StudentID.unique().tolist()))
df1 = df.groupby('Professor').apply(f).dropna()
df1 = df1.index.to_series() + ': ' + df1
print (df1)
Professor
Smith Smith: History, Physics, 323, 999
dtype: objecthttps://stackoverflow.com/questions/39646300
复制相似问题