我有一个带有JSON数据列的spark数据帧:
df = spark.createDataFrame(
[
(1, '{"a": "hello"}'),
(2, '{"b": ["foo", "bar"]}'),
(3, '{"c": {"cc": "baz"}}'),
(4, '{"d": [{"dd": "foo"}, {"dd": "bar"}]}'),
],
schema=['id', 'jsonData'],
)
df.show()
+---+--------------------+
| id| jsonData|
+---+--------------------+
| 1| {"a": "hello"}|
| 2|{"b": ["foo", "ba...|
| 3|{"c": {"cc": "baz"}}|
| 4|{"d": [{"dd": "fo...|
+---+--------------------+
关键字是模式标识符。也就是说,两个键不能有不同的模式
我需要解析此列中的json,并从每个dict中获取值。
我运行下一个命令:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df = df.withColumn("jsonParsedData", from_json("jsonData", json_schema))
df.show()
+---+--------------------+--------------------+
| id| jsonData| jsonParsedData|
+---+--------------------+--------------------+
| 1| {"a": "hello"}| [hello,,,]|
| 2|{"b": ["foo", "ba...| [, [foo, bar],,]|
| 3|{"c": {"cc": "baz"}}| [,, [baz],]|
| 4|{"d": [{"dd": "fo...|[,,, [[foo], [bar]]]|
+---+--------------------+--------------------+
我有一个
列
缺少键的值。
问题
:如何将JSON从
列,并获取一个不带
缺少键的值。
我认为
列应该有一个
类型。
预期结果
+---+--------------------+--------------------+
| id| jsonData| jsonParsedData|
+---+--------------------+--------------------+
| 1| {"a": "hello"}| hello|
| 2|{"b": ["foo", "ba...| [foo, bar]|
| 3|{"c": {"cc": "baz"}}| {"cc": "baz"}|
| 4|{"d": [{"dd": "fo...|[{"dd": "foo"}, {...|
+---+--------------------+--------------------+
发布于 2021-03-02 03:12:47
尝试使用以下命令从json中提取值
import pyspark.sql.functions as F
df2 = df.withColumn('jsonParsedData', F.regexp_extract('jsonData', '\\{"[^"]+": (.*)\\}', 1))
df2.show(truncate=False)
+---+-------------------------------------+------------------------------+
|id |jsonData |jsonParsedData |
+---+-------------------------------------+------------------------------+
|1 |{"a": "hello"} |"hello" |
|2 |{"b": ["foo", "bar"]} |["foo", "bar"] |
|3 |{"c": {"cc": "baz"}} |{"cc": "baz"} |
|4 |{"d": [{"dd": "foo"}, {"dd": "bar"}]}|[{"dd": "foo"}, {"dd": "bar"}]|
+---+-------------------------------------+------------------------------+
另一种可能更好的方法是使用
的模式
import pyspark.sql.functions as F
df2 = df.withColumn('jsonParsedData', F.map_values(F.from_json('jsonData', 'map'))[0])
df2.show(truncate=False)
+---+-------------------------------------+---------------------------+
|id |jsonData |jsonParsedData |
+---+-------------------------------------+---------------------------+
|1 |{"a": "hello"} |hello |
|2 |{"b": ["foo", "bar"]} |["foo","bar"] |
|3 |{"c": {"cc": "baz"}} |{"cc":"baz"} |
|4 |{"d": [{"dd": "foo"}, {"dd": "bar"}]}|[{"dd":"foo"},{"dd":"bar"}]|
+---+-------------------------------------+---------------------------+
https://stackoverflow.com/questions/66427633
复制相似问题