我有以下两个数据文件
df1
+--------+-----------------------------
|id | amount | fee |
|1 | 10.00 | 5.0 |
|3 | 90 | 130.0 |df2
+--------+--------------------------------
|exId | exAmount | exFee |
|1 | 10.00 | 5.0 |
|1 | 10.0 | 5.0 |
|3 | 90.0 | 130.0 |我正在使用所有三个列连接它们之间,并试图识别两个数据文件之间常见的列和那些不常见的列。
我在寻找输出:
+--------+--------------------------------------------
|id | amount | fee |exId | exAmount | exFee |
|1 | 10.00 | 5.0 |1 | 10.0 | 5.0 |
|null| null | null |1 | 10.0 | 5.0 |
|3 | 90 | 130.0|3 | 90.0 | 130.0 |基本上,希望df2中的重复行与exId 1分开列出。有什么想法吗?
发布于 2018-05-11 02:59:02
其中一种可能的方法是按所有三列进行分组,并为每个dataframe生成行号,并在连接时使用该附加列以及其余三列。你应该得到你想要的。
import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("id", "amount", "fee").orderBy("fee")
def windowSpec2 = Window.partitionBy("exId", "exAmount", "exFee").orderBy("exFee")
import org.apache.spark.sql.functions._
df1.withColumn("sno", row_number().over(windowSpec1)).join(
df2.withColumn("exSno", row_number().over(windowSpec2)),
col("id") === col("exId") && col("amount") === col("exAmount") && col("fee") === col("exFee") && col("sno") === col("exSno"), "outer")
.drop("sno", "exSno")
.show(false)你应该得到
+----+------+-----+----+--------+-----+
|id |amount|fee |exId|exAmount|exFee|
+----+------+-----+----+--------+-----+
|null|null |null |1 |10.0 |5.0 |
|3 |90 |130.0|3 |90 |130.0|
|1 |10.00 |5.0 |1 |10.00 |5.0 |
+----+------+-----+----+--------+-----+我希望答案是有帮助的。
https://stackoverflow.com/questions/50283823
复制相似问题