我有这样的df。
+---+-----+-----+----+
| M|M_Max|Sales|Rank|
+---+-----+-----+----+
| M1| 100| 200| 1|
| M1| 100| 175| 2|
| M1| 101| 150| 3|
| M1| 100| 125| 4|
| M1| 100| 90| 5|
| M1| 100| 85| 6|
| M2| 200| 1001| 1|
| M2| 200| 500| 2|
| M2| 201| 456| 3|
| M2| 200| 345| 4|
| M2| 200| 231| 5|
| M2| 200| 123| 6|
+---+-----+-----+----+我在这个df上面做一个枢轴操作,就像这样。
df.groupBy("M").pivot("Rank").agg(first("Sales")).show
+---+----+---+---+---+---+---+
| M| 1| 2| 3| 4| 5| 6|
+---+----+---+---+---+---+---+
| M1| 200|175|150|125| 90| 85|
| M2|1001|500|456|345|231|123|
+---+----+---+---+---+---+---+但我的预期产出如下。这意味着我需要得到输出中的列Max(M_Max)。
这里,M_Max是列- M_Max的最大值。我的预期产出如下所示。在不使用df联接的情况下,Pivot函数可以这样做吗?
+---+----+---+---+---+---+---+-----+
| M| 1| 2| 3| 4| 5| 6|M_Max|
+---+----+---+---+---+---+---+-----+
| M1| 200|175|150|125| 90| 85| 101|
| M2|1001|500|456|345|231|123| 201|
+---+----+---+---+---+---+---+-----+发布于 2020-03-03 15:19:12
诀窍是应用窗口函数。解决办法如下:
scala> val df = Seq(
| | ("M1",100,200,1),
| | ("M1",100,175,2),
| | ("M1",101,150,3),
| | ("M1",100,125,4),
| | ("M1",100,90,5),
| | ("M1",100,85,6),
| | ("M2",200,1001,1),
| | ("M2",200,500,2),
| | ("M2",200,456,3),
| | ("M2",200,345,4),
| | ("M2",200,231,5),
| | ("M2",201,123,6)
| | ).toDF("M","M_Max","Sales","Rank")
df: org.apache.spark.sql.DataFrame = [M: string, M_Max: int ... 2 more fields]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val w = Window.partitionBy("M")
w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@49b4e11c
scala> df.withColumn("new", max("M_Max") over (w)).groupBy("M", "new").pivot("Rank").agg(first("Sales")).withColumnRenamed("new", "M_Max").show
+---+-----+----+---+---+---+---+---+
| M|M_Max| 1| 2| 3| 4| 5| 6|
+---+-----+----+---+---+---+---+---+
| M1| 101| 200|175|150|125| 90| 85|
| M2| 201|1001|500|456|345|231|123|
+---+-----+----+---+---+---+---+---+
scala> df.show
+---+-----+-----+----+
| M|M_Max|Sales|Rank|
+---+-----+-----+----+
| M1| 100| 200| 1|
| M1| 100| 175| 2|
| M1| 101| 150| 3|
| M1| 100| 125| 4|
| M1| 100| 90| 5|
| M1| 100| 85| 6|
| M2| 200| 1001| 1|
| M2| 200| 500| 2|
| M2| 200| 456| 3|
| M2| 200| 345| 4|
| M2| 200| 231| 5|
| M2| 201| 123| 6|
+---+-----+-----+----+如果有帮助请告诉我!!
发布于 2020-03-05 06:18:44
基本上,我看到了三种可能的方法。
单独和使用answer.
M_Max的最大值。
M_Max中建议的窗口--带有支点的最大值,并使用array_max.聚合生成的列。
最有可能的是,方法1的效率较低。然而,在2到3之间,我不确定。你可以试着用你的数据告诉我们-)
办法3如下:
val df = Seq(
("M1", 100, 200, 1), ("M1", 100, 175, 2), ("M1", 101, 150, 3),
("M1", 100, 125, 4), ("M1", 100, 90, 5), ("M1", 100, 85, 6),
("M2", 200, 1001, 1), ("M2", 200, 500, 2), ("M2", 200, 456, 3),
("M2", 200, 345, 4), ("M2", 200, 231, 5), ("M2", 201, 123, 6)
).toDF("M","M_Max","Sales","Rank")
// we include the max in the pivot, so we have one max column per rank
val df_pivot = df
.groupBy("M").pivot("Rank")
.agg(first('Sales) as "first", max('M_Max) as "max")
val max_cols = df_pivot.columns.filter(_ endsWith "max").map(col)
// then we aggregate these max columns into one
val max_col = array_max(array(max_cols : _*)) as "M_Max"
// let's rename the first columns to match your expected output
val first_cols = df_pivot.columns.filter(_ endsWith "first")
.map(name => col(name) as name.split("_")(0))
// And finally, we wrap everything together
df_pivot
.select($"M" +: first_cols :+ max_col : _*)
.show(false)产额
+---+----+---+---+---+---+---+-----+
|M |1 |2 |3 |4 |5 |6 |M_Max|
+---+----+---+---+---+---+---+-----+
|M1 |200 |175|150|125|90 |85 |101 |
|M2 |1001|500|456|345|231|123|201 |
+---+----+---+---+---+---+---+-----+https://stackoverflow.com/questions/60509684
复制相似问题