文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签

问如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签
EN

Stack Overflow用户

提问于 2019-11-11 15:21:20

回答 1查看 687关注 0票数 0

我有一个文本文件，其中有一个列'descn‘，其中有一些文本，但它们是html格式的。所以我想用pyspark把html文本转换成纯文本。请帮我做这件事。

文件名：

mdcl_insigt.txt

输入：

PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>

它应该像这样转换，输出：

PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.

pyspark

pyspark-sql

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-11-11 16:16:15

你可以尝试做一个regexp_replace()

from pyspark.sql.functions import regexp_replace

df = df.withColumn("parsed_descn", regexp_replace("descn", "<[^>]+>", ""))

正则表达式并不完美，可能会失败。请多做一些研究，让它变得更好。

当我在regexr上尝试它时，它在您的示例字符串上工作

截图如下：

Pyspark输出：

df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()

[Row(parsed='PROTEUSÂ We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58797064

复制

相似问题

问如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签
EN