我有两个这样的数据帧。df1
+----+-------------+
|colA|colB |
+----+-------------+
| 1| "someval" |
| 2| "someval2"|
| 3| "someval3"|
df2
+----+-------------+
|colA|colC |
+----+-------------+
| 1| "someval" |
| 1| "someval2"|
| 2| "someval3"|
如果我内连接df1和df2 (通过colA),我会得到这样的结果。
+----+-------------+----------+
|colA|colB |colC |
+----+-------------+----------+
| 1| "someval" |"someval" |
| 1| "someval" |"someval2"|
| 2| "someval2"|"someval3"|
但是我只需要colA的不同行(因此,获取colA的顶行就足够了)
+----+-------------+----------+
|colA|colB |colC |
+----+-------------+----------+
| 1| "someval" |"someval" |
| 2| "someval2"|"someval3"|
发布于 2019-07-02 14:17:00
尝尝这个。
distinct_df = df2.dropDuplicates(['colA'])
加入你的数据帧
inner_join_df = df1.join(distinct_df, df1.colA == distinct_df.colA)
inner_join_df.show()
我已经使用pandas加入了数据帧:
import pandas as pd
data1 =[[1,'someval'],[2,'someval2'],[3,'someval3']]
data2 =[[1,'someval'],[1,'someval2'],[2,'someval3']]
df1=pd.DataFrame(data1,columns=['colA','colB'])
df2=pd.DataFrame(data2,columns=['colA','colC'])
unique_df=df2.drop_duplicates('colA')
joindf = pd.merge(df1,unique_df,on='colA',how='inner')
print(joindf )
发布于 2019-07-02 12:17:04
使用窗口函数对具有相同列A值的行进行排序,因为您喜欢使用其他列.In第二步仅筛选具有函数result 1的行
sqlContext.sql(""" select colA,colB,colC from ( SELECT *,row_number() over (PARTITION by colA order by colB,colC) as rn from df_p )x where rn=1 """ ).show(60)
https://stackoverflow.com/questions/56850042
复制相似问题