A=
+------------+------------+------+
| Name| Nationality|Salary|
+------------+------------+------+
| A. Abbas| Iraq| €2K|
| A. Abdallah| France| €1K|
|A. Abdennour| Tunisia| €31K|
B=
+------------+------------+
| Name|Salary |
+------------+------------+
| A. Abbas|€4K |
| A. Abdallah|€1K |
|A. Abdennour|€33K |
预期的updatedDF应如下所示:
+------------+------------+------+
| Name| Nationality|Salary|
+------------+------------+------+
| A. Abbas| Iraq| €4K|
| A. Abdallah| France| €1K|
|A. Abdennour| Tunisia| €33K|
我尝试使用spark scala代码,如下所示:
updatedDF = a.join(b, Seq("Name"), "inner")
updatedDF.show()
但是在执行join之后,我的输出中有重复项。如何在不重复的情况下合并两个数据帧?
发布于 2019-04-23 07:36:40
如果有重复,这意味着名称列不是唯一的。我建议尝试添加索引列以用于join,然后删除它:
// Add index now...
a = addColumnIndex(a).withColumn("index", monotonically_increasing_id)
println("1- a count: " + a.count())
// Add index now...
b = addColumnIndex(b).withColumn("index", monotonically_increasing_id)
println("b count: " + b.count())
def addColumnIndex(df: DataFrame) = {
spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
StructType(df.schema.fields :+ StructField("index", LongType, false)))
}
ab = a.join(b, Seq("index", "Name"), "inner").drop(a.col("Salary")).drop(a.col("index"))
println("3- ab count: " + ab.count())
发布于 2019-04-23 04:51:21
val a = sc.parallelize(List(("A. Abbas","Iraq","2K"),("A. Abdallah","France","1K"),("A. Abdennour","Tunisia","31K"))).toDF("Name","Nationality","Salary")
val b = sc.parallelize(List(("A. Abbas","4K"),("A. Abdallah","1K"),("A. Abdennour","33K"))).toDF("Name","Salary")
b.join(a,Seq("Name"),"inner").drop(a.col("Salary")).show
https://stackoverflow.com/questions/55800706
复制相似问题