问如何在pyspark中实现不带聚合函数的数据透视表
EN

Stack Overflow用户

提问于 2020-03-21 07:45:59

回答 1查看 74关注 0票数 0

我在pyspark中有一个这样的数据帧。

|--------------|----------------|---------------|
|   col_1      |     col_2      |   col_3       |
|-----------------------------------------------|
|       1      |       A        |     abd       |
|-----------------------------------------------|
|       1      |       B        |     acd       |
|-----------------------------------------------|
|       1      |       A        |     bcd       |
|-----------------------------------------------|
|       1      |       B        |     ceg       |
------------------------------------------------|
|       2      |       A        |     cgs       |
|-----------------------------------------------|
|       2      |       B        |     bsc       |
|-----------------------------------------------|
|       2      |       A        |     iow       |
|-----------------------------------------------|

我想要将表旋转到这里。

|--------------|----------------|---------------|
|   col_1      |       A        |      B        |
|-----------------------------------------------|
|       1      |       abd      |     acd       |
|-----------------------------------------------|
|       1      |       bcd      |     ceg       |
|-----------------------------------------------|
|       2      |       cgs      |     bsc       |
|-----------------------------------------------|
|       2      |       iow      |     null      |
------------------------------------------------|

我该怎么做呢？pyspark dataframe的pivot函数需要聚合函数，在我的例子中，col_1也不是唯一的。

pivot-table

pyspark-dataframes

回答 1

Stack Overflow用户

发布于 2020-03-23 10:01:25

这是一种你可以获得目标结果的方法：

    import pyspark.sql.functions as F

    df = df.groupBy('col_1').pivot("col_2").agg(F.collect_list("col_3"))

    cols = df.columns[1:]
    res = df.select("col_1",F.explode(cols[0]).alias(cols[0])).withColumn("id", F.monotonically_increasing_id())

    for name in cols[1:]:
        res = res.join(df.select("col_1",F.explode(name).alias(name)).withColumn("id", F.monotonically_increasing_id()),on = ["id","col_1"], how = "outer")

    res = res.orderBy("col_1").drop("id")

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60783019

复制

相似问题

问如何在pyspark中实现不带聚合函数的数据透视表
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pyspark中实现不带聚合函数的数据透视表EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pyspark中实现不带聚合函数的数据透视表
EN