文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在pyspark数据框中将字符串类型的列转换为int形式？

问如何在pyspark数据框中将字符串类型的列转换为int形式？
EN

Stack Overflow用户

提问于 2017-10-26 21:43:44

回答 3查看 194.3K关注 0票数 64

我在pyspark有数据帧。它的一些数字列包含'nan‘，所以当我读取数据并检查dataframe的模式时，这些列将具有'string’类型。如何将它们更改为int类型。我将'nan‘值替换为0，并再次检查了架构，但同时也显示了这些列的字符串类型。我遵循以下代码：

data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()

我的数据如下所示：

在这里，包含整数值的“Plays”和“drafts”列，但由于这些列中存在nan，因此它们被视为字符串类型。

python

dataframe

pyspark

回答 3

Stack Overflow用户

发布于 2017-10-27 04:27:51

from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))

您可以对每一列运行循环，但这是将字符串列转换为整数的最简单方法。

票数 120

Stack Overflow用户

发布于 2018-02-22 04:14:06

您可以在将NaN替换为0之后使用cast(as int)，

data_df = df.withColumn("Plays", df.call_time.cast('float'))

票数 17

Stack Overflow用户

发布于 2018-11-04 14:02:11

另一种方法是，如果有多个字段需要修改，则使用StructField。

例如：

from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
       StructField('OPEN_FLG',IntegerType(),True),
       StructField('I1_GNDR_CODE',StringType(),True),
       StructField('TRW_INCOME_CD_V4',StringType(),True),
       StructField('ASIAN_CD',IntegerType(),True),
       StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)

输出：

在此之前

root
 |-- CLICK_FLG: string (nullable = true)
 |-- OPEN_FLG: string (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)

之后：

root
 |-- CLICK_FLG: integer (nullable = true)
 |-- OPEN_FLG: integer (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)

这是一个稍微长一点的转换过程，但优点是所有必需的字段都可以完成。

需要注意的是，如果只有所需的字段被分配了数据类型，则所得到的数据帧将仅包含那些被改变的字段。

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46956026

复制

相似问题

问如何在pyspark数据框中将字符串类型的列转换为int形式？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pyspark数据框中将字符串类型的列转换为int形式？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pyspark数据框中将字符串类型的列转换为int形式？
EN