文章/答案/技术大牛

发布

社区首页 >问答首页 >在Python3中从网站中找到最常用的单词

问在Python3中从网站中找到最常用的单词
EN

Stack Overflow用户

提问于 2014-06-24 21:13:35

回答 4查看 9.6K关注 0票数 5

我需要找到并复制那些在一个给定的网站上出现超过5次的单词，使用Python 3代码，我不知道该如何做。我在这里查看了堆栈溢出的文档，但是其他解决方案依赖于python 2代码。下面是我到目前为止掌握的卑劣代码：

   from urllib.request import urlopen
   website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

有人对该怎么做有什么建议吗？我已经安装了NLTK，我看过漂亮的汤，但是对于我的生活，我不知道如何正确地安装它(我是蟒蛇绿的)！由于我正在学习，任何解释也将是非常感谢的。谢谢您:)

web-crawler

nltk

python

beautifulsoup

回答 4

Stack Overflow用户

回答已采纳

发布于 2014-06-24 21:38:47

这并不完美，而是一个如何让您开始使用请求、BeautifulSoup和collections.Counter的想法。

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

r = requests.get("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

soup = BeautifulSoup(r.content)

text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))

c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))
print (c.most_common()) # prints most common words staring at most common.

[('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',.............

print ([x for x in c if c.get(x) > 5]) # words appearing more than 5 times

['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']

票数 10

Stack Overflow用户

发布于 2014-06-24 21:37:59

所以，这是来自一个新手，但如果你只是需要一个快速的答案，我认为这可能是可行的。请注意，使用这种方法，您不能只是把URL与程序，您必须手动粘贴在代码中。(对不起)

text = '''INSERT TEXT HERE'''.split() #Where you see "INSERT TEXT HERE", that's where the text goes.
#also note the .split() method at the end. This converts the text into a list, splitting every word in between the spaces. 
#for example, "red dog food".split() would be ['red','dog','food']
overusedwords = [] #this is where the words that are used 5 or more times are going to be held.
for i in text: #this will iterate through every single word of the text
    if text.count(i) >= 5 and overusedwords.count(i) == 0: #(1. Read below)
        overusedwords.append(i) #this adds the word to the list of words used 5 or more times
if len(overusedwords) > 0: #if there are no words used 5 or more times, it doesn't print anything useless.
    print('The overused words are:')
    for i in overusedwords:
        print(i)
else:
    print('No words used 5 or more times.') #just in case there are no words used 5 or more times

对于"text.count(i) >= 5“部分的解释。每次它遍历For循环时，它都会检查文本中是否使用了五个或更多的特定单词。然后，对于”和overusedwords.count(i) == 0:"，这只会确保同一个单词不会被二次添加到过度使用的单词列表中。希望我有所帮助。我想你可能想要一种方法，在输入url时可以直接获得这些信息，但这可能会帮助其他有类似问题的初学者。

票数 3

Stack Overflow用户

发布于 2014-06-24 21:45:27

我会这样做：

安装BeautifulSoup，解释为这里。
你需要这些进口品：从bs4导入BeautifulSoup导入从集合导入计数器
使用获取站点上的BeautifulSoup可见文本，这将在堆栈溢出这里中得到解释。
从可见文本中获取单词的列表( lst )。 re.findall(r'\b\w+'，visible_text_string)
将每个单词转换为小写。 lst = x.lower()表示x在lst中
计算每个单词的出现次数，并列出一个(word, count)元组。计数器=计数器(Lst)occ= (word，count)表示单词，如果计数>5，则在counter.items()中计数
按发生对occs进行排序： occs.sort(key=lambda x:x1)

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24396406

复制

相似问题

问在Python3中从网站中找到最常用的单词
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python3中从网站中找到最常用的单词EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python3中从网站中找到最常用的单词
EN