我需要找到并复制那些在一个给定的网站上出现超过5次的单词,使用Python 3代码,我不知道该如何做。我在这里查看了堆栈溢出的文档,但是其他解决方案依赖于python 2代码。下面是我到目前为止掌握的卑劣代码:
from urllib.request import urlopen
website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")有人对该怎么做有什么建议吗?我已经安装了NLTK,我看过漂亮的汤,但是对于我的生活,我不知道如何正确地安装它(我是蟒蛇绿的)!由于我正在学习,任何解释也将是非常感谢的。谢谢您:)
发布于 2014-06-24 21:38:47
这并不完美,而是一个如何让您开始使用请求、BeautifulSoup和collections.Counter的想法。
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
r = requests.get("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")
soup = BeautifulSoup(r.content)
text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))
print (c.most_common()) # prints most common words staring at most common.
[('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',.............
print ([x for x in c if c.get(x) > 5]) # words appearing more than 5 times
['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']发布于 2014-06-24 21:37:59
所以,这是来自一个新手,但如果你只是需要一个快速的答案,我认为这可能是可行的。请注意,使用这种方法,您不能只是把URL与程序,您必须手动粘贴在代码中。(对不起)
text = '''INSERT TEXT HERE'''.split() #Where you see "INSERT TEXT HERE", that's where the text goes.
#also note the .split() method at the end. This converts the text into a list, splitting every word in between the spaces.
#for example, "red dog food".split() would be ['red','dog','food']
overusedwords = [] #this is where the words that are used 5 or more times are going to be held.
for i in text: #this will iterate through every single word of the text
if text.count(i) >= 5 and overusedwords.count(i) == 0: #(1. Read below)
overusedwords.append(i) #this adds the word to the list of words used 5 or more times
if len(overusedwords) > 0: #if there are no words used 5 or more times, it doesn't print anything useless.
print('The overused words are:')
for i in overusedwords:
print(i)
else:
print('No words used 5 or more times.') #just in case there are no words used 5 or more times对于"text.count(i) >= 5“部分的解释。每次它遍历For循环时,它都会检查文本中是否使用了五个或更多的特定单词。然后,对于”和overusedwords.count(i) == 0:",这只会确保同一个单词不会被二次添加到过度使用的单词列表中。希望我有所帮助。我想你可能想要一种方法,在输入url时可以直接获得这些信息,但这可能会帮助其他有类似问题的初学者。
发布于 2014-06-24 21:45:27
我会这样做:
lst )。
re.findall(r'\b\w+',visible_text_string)(word, count)元组。
计数器=计数器(Lst)occ= (word,count)表示单词,如果计数>5,则在counter.items()中计数occs进行排序:
occs.sort(key=lambda x:x1)https://stackoverflow.com/questions/24396406
复制相似问题