文章/答案/技术大牛

发布

社区首页 >问答首页 >从python中的列表中删除标点符号

问从python中的列表中删除标点符号
EN

Stack Overflow用户

提问于 2016-12-02 00:59:13

回答 2查看 4.8K关注 0票数 0

我知道这是一个常见的问题，但我还没有找到一个适用的答案。我正在尝试从一个单词列表中删除标点符号，该单词列表是我在前面的函数中抓取HTML页面时得到的。这就是我所拥有的：

import re
def strip_text():    
        list_words = get_text().split()
        print(list_words)
        for i in range(len(list_words)):
            list_words = re.sub("[^a-zA-Z]"," ",list_words)
            list_words = list_words.lower()
        return list_words
    print(get_text()) 
    print(strip_text())

我意识到这不起作用，因为re.sub位应该用在字符串上，而不是列表上。有没有同样有效的方法来做到这一点？我是否应该再次将单词列表设置为字符串？

编辑:这个问题是从HTML页面中抓取文本，就像我说的那样。上述代码之前的代码如下：

from bs4 import BeautifulSoup
import requests
from collections import Counter
import re
tokens = []
types= Counter(tokens)
#str_book = ""
str_lines = ""
import string

def get_text(): 
   # str_lines = ""
    url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data, 'html.parser')
    text = soup.find_all('p') #finds all of the text between <p>
    i=0
    for p in text:
        i+=1
        line = p.get_text()
        if (i<10):
            continue
        print(line)
    return line

所以单词列表将是我正在使用的阿加莎·克里斯蒂一书中所有单词的列表。希望这能有所帮助。

python

回答 2

Stack Overflow用户

发布于 2016-12-02 01:11:28

您根本不需要regex。string.punctuation包含所有标点符号。只需迭代并跳过这些。

>>> import string
>>> ["".join( j for j in i if j not in string.punctuation) for i in  lst]

票数 3

Stack Overflow用户

发布于 2016-12-02 03:39:22

看一下get_text()，我们似乎需要修改一些东西，然后才能删除任何标点符号。我在这里添加了一些评论。

def get_text(): 
    str_lines = []  # create an empty list
    url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data, 'html.parser')
    text = soup.find_all('p') #finds all of the text between <p>
    i=0
    for p in text:
        i+=1
        line = p.get_text()
        if (i<10):
            continue
        str_lines.append(line)  # append the current line to the list
    return str_lines  # return the list of lines

首先，我取消了str_lines变量的注释，并将其设置为一个空列表。接下来，我将print语句替换为将该行附加到行列表的代码。最后，我更改了return语句以返回行列表。

对于strip_text()，我们可以将其简化为几行代码：

def strip_text():    
    list_words = get_text()
    list_words = [re.sub("[^a-zA-Z]", " ", s.lower()) for s in list_words]
    return list_words

不需要逐字操作，因为我们可以查看整行并删除所有标点符号，因此我删除了split()。使用列表理解，我们可以在一行中更改列表的每个元素，我还在其中放入了lower()方法来压缩代码。

要实现@AhsanulHaque提供的答案，您只需将strip_text()方法的第二行替换为它，如下所示：

def strip_text():
    list_words = get_text()
    list_words = ["".join(j.lower() for j in i if j not in string.punctuation)
                  for i in list_words]
    return list_words

有趣的是，下面是我前面提到的为Python3.x实现的translate方法，如here所述

def strip_text():
    list_words = get_text()
    translator = str.maketrans({key: None for key in string.punctuation})
    list_words = [s.lower().translate(translator) for s in list_words]
    return list_words

不幸的是，我不能为你的特定代码计时，因为Gutenberg暂时阻止了我(我想，太多的代码运行太快了)。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40916263

复制

相似问题

问从python中的列表中删除标点符号
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从python中的列表中删除标点符号EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从python中的列表中删除标点符号
EN