我有一个数据文件,如下所示
+-----------------------------+
| Item |
+-----------------------------+
|[[a,b,c], [d,e,f], [g,h,i]] |
+--------------------+--------+
如何将其转换为下表?
a b c
d e f
g h i
我尝试过使用explode
和withColumn
函数
a b c
a e c
a h c
d b c
d e c
d h c
... (many other combinations)
发布于 2022-01-02 15:44:55
只需引爆第一级数组,就可以选择数组元素作为列:
import pyspark.sql.functions as F
df = spark.createDataFrame(
[([["a","b","c"], ["d","e","f"], ["g","h","i"]],)],
["Item"]
)
df.withColumn("Item", F.explode("Item")).select(
*[F.col("Item")[i].alias(f"col_{i}") for i in range(3)]
).show()
#+-----+-----+-----+
#|col_0|col_1|col_2|
#+-----+-----+-----+
#| a| b| c|
#| d| e| f|
#| g| h| i|
#+-----+-----+-----+
发布于 2022-01-02 21:26:24
@黑主教改善你的答案..。
import pyspark.sql.functions as F
df = spark.createDataFrame(
[([["a","b","c"], ["d","e","f"], ["g","h","i", "j"]],)],
["data"]
)
df.show(20, False)
df = df.withColumn("data1", F.explode("data"))
df.select('data1').show()
# Row(max(size(data1))=4) ---> 4
max_size = df.select(F.max(F.size('data1'))).collect()[0][0]
df.select(
*[F.col("data1")[i].alias(f"col_{i}") for i in range(max_size)]
).show()
+------------------------------------+
|data |
+------------------------------------+
|[[a, b, c], [d, e, f], [g, h, i, j]]|
+------------------------------------+
+------------+
| data1|
+------------+
| [a, b, c]|
| [d, e, f]|
|[g, h, i, j]|
+------------+
+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|
+-----+-----+-----+-----+
| a| b| c| null|
| d| e| f| null|
| g| h| i| j|
+-----+-----+-----+-----+
https://stackoverflow.com/questions/70557310
复制相似问题