我正在尝试读取具有以下数据的csv:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3使用inferSchema会导致停止字段溢出到下一列,并破坏数据格式。
如果我给出我自己的模式,比如:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])这一例外的结果是:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
那么,如果没有这种失败,我如何正确地阅读csv呢?
发布于 2022-04-22 15:15:58
因为csv不支持数组,所以您需要首先读取为字符串,然后转换它。
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))发布于 2022-04-22 14:03:00
我想这就是你要找的:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()如果有帮助请告诉我
https://stackoverflow.com/questions/71969652
复制相似问题