在Pyspark Dataframe中选择列

内容来源于 Stack Overflow,并遵循CC BY-SA 3.0许可协议进行翻译与使用

  • 回答 (2)
  • 关注 (0)
  • 查看 (334)

我正在寻找一种方法来在pyspark中选择我的数据帧的列。对于第一行,我知道我可以使用df.first()但不确定列,因为它们没有列名。

我有5列,想要遍历每一列。

+--+---+---+---+---+---+---+
|_1| _2| _3| _4| _5| _6| _7|
+--+---+---+---+---+---+---+
|1 |0.0|0.0|0.0|1.0|0.0|0.0|
|2 |1.0|0.0|0.0|0.0|0.0|0.0|
|3 |0.0|0.0|1.0|0.0|0.0|0.0|
提问于
用户回答回答于

数据集ss.csv包含我感兴趣的一些列:

ss_ = spark.read.csv("ss.csv", header= True, 
                      inferSchema = True)
ss_.columns
['Reporting Area', 'MMWR Year', 'MMWR Week', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Med', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Med, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Max', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Max, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2018', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2018, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2017', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2017, flag', 'Shiga toxin-producing Escherichia coli, Current week', 'Shiga toxin-producing Escherichia coli, Current week, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max, flag', 'Shiga toxin-producing Escherichia coli, Cum 2018', 'Shiga toxin-producing Escherichia coli, Cum 2018, flag', 'Shiga toxin-producing Escherichia coli, Cum 2017', 'Shiga toxin-producing Escherichia coli, Cum 2017, flag', 'Shigellosis, Current week', 'Shigellosis, Current week, flag', 'Shigellosis, Previous 52 weeks Med', 'Shigellosis, Previous 52 weeks Med, flag', 'Shigellosis, Previous 52 weeks Max', 'Shigellosis, Previous 52 weeks Max, flag', 'Shigellosis, Cum 2018', 'Shigellosis, Cum 2018, flag', 'Shigellosis, Cum 2017', 'Shigellosis, Cum 2017, flag']

但我只需要几个:

columns_lambda = lambda k: k.endswith(', Current week') or k == 'Reporting Area' or k == 'MMWR Year' or  k == 'MMWR Week'

过滤器返回所需列的列表,列表被评估:

sss = filter(columns_lambda, ss_.columns)
to_keep = list(sss)

将所需列的列表解压缩为dataframe select函数的参数,该函数返回仅包含列表中列的数据集:

dfss = ss_.select(*to_keep)
dfss.columns

结果:

['Reporting Area',
 'MMWR Year',
 'MMWR Week',
 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week',
 'Shiga toxin-producing Escherichia coli, Current week',
 'Shigellosis, Current week']

df.select()具有互补对:http//spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop

删除列的列表。

用户回答回答于

前两列五行

 df.select(df.columns[:2]).take(5)

扫码关注云+社区

领取腾讯云代金券