我在阐述这个问题时有点问题,但我会试着解释一下。我知道如何分解数组的单列,但我有多个数组列,这些数组在索引值方面彼此对齐。在我的dataframe中,分解每一列基本上只是做一个无用的交叉连接,导致几十个无效的行。因此,我将从显示数据开始。
这显示了来自SparkNLP的一些结果,其中包含一些文本和文本的四组特征。从tr到nr的每一列都包含一个数组。这些数组中的每一个都与其他数组对齐。
+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|ID| text| tr| lr| pr| nr|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+我想要的是一个新的dataframe,其中包含ID和文本,以及一行中所有数组中的每个第i项,如上面的dataframe所示:
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|ID| text| tr| lr| pr| nr| token| lemma|pos| ner|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]| thing| thing| NN| O|
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]| :| :| :| O|
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]|MacKay| MacKay|NNP|I-PER|
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]| rolls| roll|NNS| O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| thing| thing| NN| O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| :| :| :| O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|MacKay| MacKay|NNP|I-PER|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| roll| roll|NNS| O|
|11|...
...
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| *| *| NN| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| I| I|PRP| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| would| would| MD| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| like| like| VB| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| to| ...|...| O|
|12|...
...
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+我不需要输出中的tr到nr列,但为了清楚起见,我把它们留了下来。
有没有办法做到这一点?
此外,是否还有一种方法可以同时提取数组索引(添加到输出行)?
发布于 2020-11-11 11:07:33
在这个场景中,您想要做的是使用withColumn表达式分解单个列。假设您已将数据集加载为初始数据帧df。现在,您希望实现如下所示的功能。
val df = <load initial dataset>
val df1 = df.select($"id", $"text",$"tr", $"lr", $"pr", $"nr").withColumn("tr", explode($"tr")).withColumn("lr",explode($"lr")).withColumn("pr",explode($"pr")).withColumn("nr",explode($"nr"))这将导致将数组值添加到用ID & Text标记的记录中。这种方法的缺点是增加了记录计数和非数组列的重复。
https://stackoverflow.com/questions/60065085
复制相似问题