我正在用pySpark(Python3)测试MLlib标记器:
# -*- coding: utf-8 -*-
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.ml.feature import Tokenizer, RegexTokenizer
# Creating dataframe
sentenceData = spark.createDataFrame([
(["Eu acho que MLlib é incrível!"]),
(["Muito mais legal do que scikit-learn"])
], ["raw"])
# Putting sequential indexer on DataFrame
w = Window.orderBy('raw')
sentenceData = sentenceData.withColumn("id", row_number().over(w))
# Configuring regexTokenizer
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")
# Applying Tokenizer to dataset
sentenceData = regexTokenizer.transform(sentenceData)
sentenceData.select(
'id','raw','words'
).show(truncate=False)
结果是:
+---+------------------------------------+--------------------------------------------+
|id |raw |words |
+---+------------------------------------+--------------------------------------------+
|1 |Eu acho que MLlib é incrível! |[eu, acho, que, mllib, incr, vel] |
|2 |Muito mais legal do que scikit-learn|[muito, mais, legal, do, que, scikit, learn]|
+---+------------------------------------+--------------------------------------------+
正如你所看到的,单词“incrível”(葡萄牙语的意思是“令人惊叹的”)被转换为两个“新词”,因为字符“í”。我在文档中没有找到任何可以解决这个问题的东西。所以,我在这里迷路了!
我曾尝试更改“regexTokenizer”配置中的“pattern”,包括直接使用“í”和其他模式,包括“pattern=”模式中的“\w”char (类似于“\Wí\w+”),但不起作用!有没有办法在模式中设置‘葡萄牙语’或以某种方式强制Spark不忽略口音?
谢谢!
发布于 2020-03-14 04:02:27
试一试
pattern="[\\p{L}\\w]+"
它通过使用Scala代码为我工作,如下所示:
val tokenizer = new RegexTokenizer().setGaps(false)
.setPattern("[\\p{L}\\w]+")
.setInputCol("raw")
.setOutputCol("words")
https://stackoverflow.com/questions/59628739
复制