文章/答案/技术大牛

发布

社区首页 >问答首页 >字典任务。选择所需的，然后订购

问字典任务。选择所需的，然后订购
EN

Stack Overflow用户

提问于 2016-10-27 13:22:11

回答 2查看 133关注 0票数 1

堆叠的追随者！

我现在正努力解决一项任务。下面是：

'‘编写一个函数sameword(u1，u2，enc，k)，用于：

取2个urls，enc =‘utf8 8’： u1 =‘http://...u1.’ u2 =‘http://...u2.’
在此网页上，u1和u2找到两个页面上出现的长度为k的单词。
数一数每一页上单词的出现次数。
返回包含三个参数组的列表： word (见第2段)、occur1 (在页面u1上出现了多少次单词)、occur2 (页面u2上出现了多少次单词)
返回的列表应按照两个页面上发生的总数排序。

使用此代码删除所有notalphabetic字符

def mywords(s):              # delet nonalphabetic characters
    for c in '''!?/-,():;--'.\_[]"{}''':
        s = s.replace(c, ' ')
    return s.split()            # return a list of all words from page with my url

import urllib.request as ul

def myurl(u, enc):      #open my url
    p = ul.urlopen(u)
    t = p.read()
    p.close()

    return mywords(t.lower())

然后，我遇到了3-5点的困难，并且被困住了(主要是因为如果没有什么东西，我用pythontutor.com在线检查代码，但在本例中，我不能这样做，因为它不支持urllib库)

谢谢！

python

function

dictionary

web

回答 2

Stack Overflow用户

发布于 2016-10-27 13:35:41

在我看来，你需要一个测试环境。我会在本地机器上安装Python，并在IDLE ( Python附带的编辑器)中测试代码。

关于你的家庭作业的一些想法。1.函数myurl读取远程html文件的内容，因为它只是一个文件。您可以将代码放在read()和close()语句之间。您可以逐行遍历文件，然后逐字逐句地查找符合您需要的单词。您可能想先从单词中删除不需要的字符，首先使用函数mywords。

希望这能有所帮助。

票数 0

Stack Overflow用户

发布于 2016-10-29 00:35:10

还不清楚您有什么问题，但我编写了一个符合规范的程序，它可能是您的基础。但是，在理解了这一点之后，您肯定应该自己实现。

好的，首先: urllib从python 2急剧变化到3。HTMLParser类也根据版本的不同在不同的地方。这意味着，如果您希望代码是可移植的，则需要做一些额外的工作。我选择用不同的方向来隐藏这些问题，方法是使用请求请求页面，使用BeautifulSoup解析它们的内容。显然，这是一种权衡，因为您必须安装非本地库：

在你的壳里：

$ pip install requests beautifulsoup4

好处是代码变得非常简短：

webpage_words_counter_example.py：

import re
from collections import Counter

import requests
from bs4 import BeautifulSoup


def get_words_from_url(url):
    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")
    for script in soup(["script", "style"]):
        script.extract()
    # returns a list of words.
    return re.sub('\W', ' ', soup.get_text()).lower().split()


def samewords(u1, u2, enc, k):
    # Retrieve content, filter words with specified 
    # length and initialize repetition counters.
    w1, w2 = map(Counter,
                 (filter(lambda word: len(word) == k, words)
                  for words in map(get_words_from_url,
                                   (u1, u2))))

    # Map all words to a list of tuples (word, count_w1, count_w2)
    # and sort this list by count_w1 and count_w2.
    return sorted(map(lambda x: (x, w1[x], w2[x]),
                      set(w1.keys()) | set(w2.keys())),
                  # disregard the word itself when sorting,
                  # considering only its occurrence in each text.
                  key=lambda x: x[1:],
                  # reversed array, so most frequent words come first.
                  reverse=True)


if __name__ == '__main__':
    word_count = samewords(
        'https://www.theguardian.com/environment/2016/oct/27/scheme-to-reopen-river-severn-to-fish-wins-almost-20m-in-funding',
        'https://www.theguardian.com/environment/2016/oct/27/world-on-track-to-lose-two-thirds-of-wild-animals-by-2020-major-report-warns',
        'utf-8',
        10
    )

    print('word count:', word_count)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40285747

复制

相似问题

问字典任务。选择所需的，然后订购
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字典任务。选择所需的，然后订购EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字典任务。选择所需的，然后订购
EN