首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >如何在spark Dataframe中使用Except函数

如何在spark Dataframe中使用Except函数
EN

Stack Overflow用户
提问于 2018-12-04 14:51:10
回答 3查看 16K关注 0票数 3

我想得到两个数据帧之间的差异,但只返回具有不同字段的行。例如,我有2个数据帧,如下所示:

val DF1 = Seq(
    (3,"Chennai",  "rahman",9846, 45000,"SanRamon"),
    (1,"Hyderabad","ram",9847, 50000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")

val DF2 = Seq(
    (3,"Chennai",  "rahman",9846, 45000,"SanRamon"),
    (1,"Sydney","ram",9847, 48000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")

这两个数据帧之间的唯一区别是第二行的emp_cityemp_sal。现在,我使用except函数,它给出了整个行,如下所示:

DF1.except(DF2)

+------+---------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+---------+--------+---------+-------+--------+
|     1|Hyderabad|     ram|     9847|  50000|      SF|
+------+---------+--------+---------+-------+--------+

但是,我需要这样的输出:

+---------+--------+-----+
|emp_id| emp_city|emp_sal|
+------+---------+-------+
|     1|Hyderabad|  50000| 
+------+---------+-------+

它显示了不同的细胞以及emp_id

编辑:如果列中有更改,则它应该显示如果没有更改,则它应该是隐藏的或为空

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-12-05 09:09:54

我找到了这个似乎运行良好的解决方案:

val cols = DF1.columns.filter(_ != "emp_id").toList
val DF3 = DF1.except(DF2)
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*)

它按如下方式生成输出:

+------+-------------------+--------+---------+--------------+--------+
|emp_id|           emp_city|emp_name|emp_phone|       emp_sal|emp_site|
+------+-------------------+--------+---------+--------------+--------+
|     1|[Sydney, Hyderabad]|    null|     null|[48000, 50000]|    null|
|    
+------+-------------------+--------+---------+--------------+--------+
票数 1
EN

Stack Overflow用户

发布于 2021-05-07 02:59:56

下面的代码应该会给出你想要的结果。

DF1.除了(DF2).select(“emp_id”,"emp_city","emp_sal")

票数 3
EN

Stack Overflow用户

发布于 2018-12-04 20:25:25

您应该考虑@user238607的注释,因为我们无法预测哪些列将不同,

不过,您仍然可以尝试此解决方法。

我假设emp_id是独一无二的

scala> val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )

scala> DF1.join(DF2, DF1("emp_id") === DF2("emp_id"))
res15: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 10 more fields]

scala> res15.withColumn("diffcolumn", split(concat_ws(",",DF1.columns.map(x => diff(lit(x), DF1(x), DF2(x))):_*),","))
res16: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 11 more fields]

scala> res16.show(false)
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|emp_id|emp_city |emp_name|emp_phone|emp_sal|emp_site|emp_id|emp_city|emp_name|emp_phone|emp_sal|emp_site|diffcolumn                 |
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|3     |Chennai  |rahman  |9846     |45000  |SanRamon|3     |Chennai |rahman  |9846     |45000  |SanRamon|[, , , , , ]               |
|1     |Hyderabad|ram     |9847     |50000  |SF      |1     |Sydney  |ram     |9847     |48000  |SF      |[, emp_city, , , emp_sal, ]|
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+

scala> val diff_cols = res16.select(explode($"diffcolumn")).filter("col != ''").distinct.collect.map(a=>col(a(0).toString))

scala> val exceptOpr = DF1.except(DF2)

scala> exceptOpr.select(diff_cols:_*).show

+-------+---------+
|emp_sal| emp_city|
+-------+---------+
|  50000|Hyderabad|
+-------+---------+
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/53607190

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档