假设我有这样的数据
val customer = Seq(
    ("C1", "Jackie Chan", 50, "Dayton", "M"),
    ("C2", "Harry Smith", 30, "Beavercreek", "M"),
    ("C3", "Ellen Smith", 28, "Beavercreek", "F"),
    ("C4", "John Chan", 26, "Dayton","M")
  ).toDF("cid","name","age","city","sex")如何在一列中获取cid值,以及如何在array < struct < column_name, column_value > >中获取其馀值?
发布于 2019-05-10 14:04:48
唯一的困难是数组必须包含相同类型的元素。因此,在将所有列放入数组之前,都需要将它们转换为字符串(在您的示例中,age是一个int )。情况如下:
val cols = customer.columns.tail
val result = customer.select('cid,
    array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")
result.show(false)
+---+-----------------------------------------------------------+
|cid|array                                                      |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]]     |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]]       |
+---+-----------------------------------------------------------+
result.printSchema()
root
 |-- cid: string (nullable = true)
 |-- array: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- value: string (nullable = true)发布于 2019-05-10 14:03:41
您可以使用数组和struct函数来完成这个任务:
customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))
将使:
 |-- cid: string (nullable = true)
 |-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- column_name: string (nullable = false)
 |    |    |-- column_value: string (nullable = true)发布于 2020-03-06 19:36:28
映射列可能是处理总体问题的更好方法。您可以在同一个映射中保留不同的值类型,而不必将其转换为string。
df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )或者,如果您想要的话,可以将它封装在一个数组中。
这样,您仍然可以对相关的键或值进行数值或字符串转换。例如:
df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )
df.select('*', 
      map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)希望这是有意义的,在这里直接输入,所以如果某个地方缺少一个括号,请原谅
https://stackoverflow.com/questions/56078815
复制相似问题