文章/答案/技术大牛

发布

社区首页 >问答首页 >过滤文本文件中的外来停用词

问过滤文本文件中的外来停用词
EN

Stack Overflow用户

提问于 2014-08-27 06:02:20

回答 3查看 576关注 0票数 0

我有一个英文和几种外语的电影名称列表，编译成一个文本文件，每个名称打印在一个新的行中：

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex ist auch keine Lˆsung
Keine Angst Liebling, ich pass schon auf
Keiner hat das Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
La Prima Donna
La Primeriza
La Prison De Saint-Clothaire
La Puppe
La P·jara
La PÈrgola de las Flores

我已经编辑了一个简短的非英语停用词列表，我想从文本文件中过滤掉。拉，德，拉斯，达斯。我可以做些什么来读取我的文本，过滤单词，然后将过滤后的列表以原始格式打印到新的文本文件中？所需的输出应大致如下所示：

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex keine Lˆsung
Keine Angst Liebling, pass schon
Keiner hat Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
Prima Donna
Primeriza
Prison Saint-Clothaire
Puppe
P·jara
Èrgola Flores

python

stop-words

回答 3

Stack Overflow用户

发布于 2014-08-27 06:24:14

您可以使用re模块(https://docs.python.org/2/library/re.html#re.sub )将不需要的字符串替换为空格。像这样的东西应该是有效的：

    import re
    #save your undesired text here. You can use a different data structure
    #  if the list is big and later build your match string like below
    unDesiredText = 'abc|bcd|vas'

    #set your inputFile and outputFile appropriately
    fhIn = open(inputFile, 'r')
    fhOut = open(outputFile, 'w')

    for line in fhIn:
        line = re.sub(unDesiredText, '', line)
        fhOut.write(line)

    fhIn.close()
    fhOut.close

票数 1

Stack Overflow用户

发布于 2014-08-27 06:53:06

另一种方法，如果您对异常处理和其他相关细节感兴趣：

import re

stop_words = ['de', 'la', 'el']
pattern = '|'.join(stop_words)
prog = re.compile(pattern, re.IGNORECASE)  # re.IGNORECASE to catch both 'La' and 'la' 

input_file_location = 'in.txt'
output_file_location = 'out.txt'

with open(input_file_location, 'r') as fin:
    with open(output_file_location, 'w') as fout:
        for l in fin:
            m = prog.sub('', l.strip())  # l.strip() to remove leading/trailing whitespace
            m = re.sub(' +', ' ', m)  # suppress multiple white spaces
            fout.write('%s\n' % m.strip())

票数 1

Stack Overflow用户

发布于 2014-08-27 06:38:57

读入文件：

with open('file', 'r') as f:
    inText = f.read()

我有一个函数，你可以在文本中提供一个你不想要的字符串，但是你可以一次对整个文本执行这个操作，而不仅仅是逐行操作。此外，您希望全局使用文本，所以我建议您创建一个类：

class changeText( object ):
    def __init__(self, text):
        self.text = text
    def erase(self, badText):
        self.text.replace(badText, '')

但是，当您用空格替换单词时，会出现两个空格，以及\n后跟空格，因此请创建一个方法来清理生成的文本。

    def cleanup(self):
        self.text.replace('  ', ' ')
        self.text.replace('\n ', '\n')

初始化对象：

textObj = changeText( inText )

然后遍历不好的单词列表并清理：

for bw in badWords:
    textObj.erase(bw)
textObj.cleanup()

最后，写下它：

with open('newfile', 'r') as f:
    f.write(textObj.text)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25515881

复制

相似问题

问过滤文本文件中的外来停用词
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问过滤文本文件中的外来停用词EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问过滤文本文件中的外来停用词
EN