文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在pyspark中合并重复的列？

问如何在pyspark中合并重复的列？
EN

Stack Overflow用户

提问于 2021-06-18 10:48:23

回答 2查看 312关注 0票数 2

我有一个pyspark dataframe，其中一些列具有相同的名称。我想将具有相同名称的所有列合并到一列中。例如，输入dataframe：

我如何在pyspark中做到这一点？任何帮助都将不胜感激。

apache-spark-sql

apache-spark

pyspark

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-06-18 15:43:15

进行编辑以响应从列表合并的OP请求，

下面是一个可重复使用的示例

    import pyspark.sql.functions as F

    df = spark.createDataFrame([
        ("z","a", None, None),
        ("b",None,"c", None),
        ("c","b", None, None),
        ("d",None, None, "z"),
    ], ["a","c", "c","c"])
    
    df.show()
    
    #fix duplicated column names
    old_col=df.schema.names
    running_list=[]
    new_col=[]
    i=0
    for column in old_col:
        if(column in running_list):
            new_col.append(column+"_"+str(i))
            i=i+1
        else:
            new_col.append(column)
            running_list.append(column)
    print(new_col)
    
    df1 = df.toDF(*new_col)
    
    #coalesce columns to get one column from a list

a=['c','c_0','c_1']
to_drop=['c_0','c_1']
b=[]
[b.append(df1[col]) for col in a]

#coalesce columns to get one column
df_merged=df1.withColumn('c',F.coalesce(*b)).drop(*to_drop)
   
df_merged.show()

输出：

+---+----+----+----+
|  a|   c|   c|   c|
+---+----+----+----+
|  z|   a|null|null|
|  b|null|   c|null|
|  c|   b|null|null|
|  d|null|null|   z|
+---+----+----+----+

['a', 'c', 'c_0', 'c_1']

+---+---+
|  a|  c|
+---+---+
|  z|  a|
|  b|  c|
|  c|  b|
|  d|  z|
+---+---+

票数 1

Stack Overflow用户

发布于 2021-06-18 18:50:50

检查下面的scala代码。这可能会对你有帮助。

scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try

implicit class DFHelpers(df: DataFrame) {
   def mergeColumns() = {
       val dupColumns = df.columns
       val newColumns = dupColumns.zipWithIndex.map(c => s"${c._1}${c._2}")
       val columns = newColumns
                        .map(c => (c(0),c))
                        .groupBy(_._1)
                        .map(c => (c._1,c._2.map(_._2)))
                        .map(c => s"""coalesce(${c._2.mkString(",")}) as ${c._1}""")
                        .toSeq
       df.toDF(newColumns:_*).selectExpr(columns:_*)
   }
}

// Exiting paste mode, now interpreting.

scala> df.show(false)
+----+----+----+----+----+----+
|a   |b   |a   |c   |a   |b   |
+----+----+----+----+----+----+
|4   |null|null|8   |null|21  |
|null|8   |7   |6   |null|null|
|96  |null|null|null|null|78  |
+----+----+----+----+----+----+

scala> df.printSchema
root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)
 |-- a: string (nullable = true)
 |-- c: string (nullable = true)
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)

scala> df.mergeColumns.show(false)
+---+---+----+
|b  |a  |c   |
+---+---+----+
|21 |4  |8   |
|8  |7  |6   |
|78 |96 |null|
+---+---+----+

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68028695

复制

相似问题

问如何在pyspark中合并重复的列？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pyspark中合并重复的列？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pyspark中合并重复的列？
EN