首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >Pyspark将字符串列表转换为ArrayType()

Pyspark将字符串列表转换为ArrayType()
EN

Stack Overflow用户
提问于 2021-11-11 05:05:58
回答 2查看 39关注 0票数 1

我对pyspark还是个新手,我需要一些指导。因此,我正在处理一些文本数据,最终我希望摆脱那些在整个语料库中出现频率不够高或出现频率太高的单词。

数据如下所示,每行包含一句话:

代码语言:javascript
运行
复制
+--------------------+
|             cleaned|
+--------------------+
|China halfway com...|
|MCI overhaul netw...|
|script kiddy join...|
|look Microsoft Mo...|
|Americans appear ...|
|Oil Eases Venezue...|
|Americans lose be...|
|explosion Echo Na...|
|Bush tackle refor...|
|jail olympic pool...|
|coyote sign RW Jo...|
|home pc key Windo...|
|bomb defuse Blair...|
|Livermore   need ...|
|hat ring fast Wi ...|
|Americans dutch s...|
|Insect Vibrations...|
|Britain sleepwalk...|
|Ron Regan Jr Kind...|
|IBM buy danish fi...|
+--------------------+

因此,基本上我使用split()pyspark.sql.functions中拆分字符串,然后计算每个单词的出现次数,提出一些标准,并创建一个需要删除的单词列表。

然后,我使用以下函数

代码语言:javascript
运行
复制
from pyspark.sql.functions import udf
from pyspark.sql.types import *


def remove_stop_words(list_of_tokens, list_of_stopwords):
    '''
    A very simple fuction that takes in a list of word tokens and then gets rid of words that are in stopwords list
    '''
    return [token for token in list_of_tokens if token not in list_of_stopwords]

def udf_remove_stop_words(list_of_stopwords):
    '''
    creates a udf that takes in a list of stop words and passes them onto remove_stop_words
    '''
    return udf(lambda x: remove_stop_words(x, list_of_stopwords))
 
wordsNoStopDF = splitworddf.withColumn('removed', udf_remove_stop_words(list_of_words_to_get_rid)(col('split')))

其中list_of_words_to_get_rid是我试图删除的单词列表,此管道的输入如下所示

代码语言:javascript
运行
复制
+--------------------+
|               split|
+--------------------+
|[China, halfway, ...|
|[MCI, overhaul, n...|
|[script, kiddy, j...|
|[look, Microsoft,...|
|[Americans, appea...|
|[Oil, Eases, Vene...|
|[Americans, lose,...|
|[explosion, Echo,...|
|[Bush, tackle, re...|
|[jail, olympic, p...|
+--------------------+
only showing top 10 rows

输出如下所示,其中包含相应的模式

代码语言:javascript
运行
复制
+--------------------+--------------------+
|               split|             removed|
+--------------------+--------------------+
|[China, halfway, ...|[China, halfway, ...|
|[MCI, overhaul, n...|[MCI, overhaul, n...|
|[script, kiddy, j...|[script, join, fo...|
|[look, Microsoft,...|[look, Microsoft,...|
|[Americans, appea...|[Americans, appea...|
|[Oil, Eases, Vene...|[Oil, Eases, Vene...|
|[Americans, lose,...|[Americans, lose,...|
|[explosion, Echo,...|[explosion, Echo,...|
|[Bush, tackle, re...|[Bush, tackle, re...|
|[jail, olympic, p...|[jail, olympic, p...|
|[coyote, sign, RW...|[coyote, sign, Jo...|
|[home, pc, key, W...|[home, pc, key, W...|
|[bomb, defuse, Bl...|[bomb, defuse, Bl...|
|[Livermore, , , n...|[Livermore, , , n...|
|[hat, ring, fast,...|[hat, ring, fast,...|
|[Americans, dutch...|[Americans, dutch...|
|[Insect, Vibratio...|[tell, Good, Time...|
|[Britain, sleepwa...|[Britain, big, br...|
|[Ron, Regan, Jr, ...|[Ron, Jr, Guy, , ...|
|[IBM, buy, danish...|[IBM, buy, danish...|
+--------------------+--------------------+

root
 |-- split: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- removed: string (nullable = true)

所以我的问题是,如何将列removed转换为类似于split的数组?我希望使用explode来统计单词出现的次数,但我似乎不太清楚该怎么做。我尝试使用regex_replace去掉方括号,然后使用,作为要拆分的模式来拆分字符串,但这似乎只向列remove添加了一个方括号。

是否可以对我正在使用的函数进行一些更改,使它们返回一个字符串数组,如列split

这里的任何指导都将不胜感激!

EN

回答 2

Stack Overflow用户

发布于 2021-11-11 05:10:35

您还没有为您的UDF定义返回类型which is StringType by default,这就是为什么您得到removed列是一个字符串的原因。您可以像这样添加use返回类型

代码语言:javascript
运行
复制
from pyspark.sql import types as T

udf(lambda x: remove_stop_words(x, list_of_stopwords), T.ArrayType(T.StringType()))
票数 0
EN

Stack Overflow用户

发布于 2021-11-11 12:21:58

您可以更改UDF的返回类型。但是,我建议不要使用任何udf从类型为list_of_words_to_get_rid的列splited中删除单词数组列表,因为您可以简单地使用spark内置函数array_except

下面是一个例子:

代码语言:javascript
运行
复制
import pyspark.sql.functions as F

df = spark.createDataFrame([("a simple sentence containing some words",)], ["cleaned"])

list_of_words_to_get_rid = ["some", "a"]

wordsNoStopDF = df.withColumn(
    "split",
    F.split("cleaned", " ")
).withColumn(
    "removed",
    F.array_except(
        F.col("split"),
        F.array(*[F.lit(w) for w in list_of_words_to_get_rid])
    )
).drop("cleaned")

wordsNoStopDF.show(truncate=False)
#+----------------------------------------------+-------------------------------------+
#|split                                         |removed                              |
#+----------------------------------------------+-------------------------------------+
#|[a, simple, sentence, containing, some, words]|[simple, sentence, containing, words]|
#+----------------------------------------------+-------------------------------------+
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69923446

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档