文章/答案/技术大牛

发布

问网络爬虫递归BeautifulSoup
EN

Stack Overflow用户

提问于 2018-03-05 22:12:59

回答 1查看 1.8K关注 0票数 2

我试图递归地抓取所有英文文章链接的维基百科网址。我想先执行n的深度遍历，但出于某种原因，我的代码并不是每次遍历都会重复出现。知道为什么吗？

def crawler(url, depth):
    if depth == 0:
        return None
    links = bs.find("div",{"id" : "bodyContent"}).findAll("a" , href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))

    print ("Level ",depth," ",url)
    for link in links:
        if ':' not in link['href']:
            crawler("https://en.wikipedia.org"+link['href'], depth - 1)

这是对爬行器的呼叫

url = "https://en.wikipedia.org/wiki/Harry_Potter"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
crawler(url,3)

web-crawler

python

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-03-06 04:31:39

您需要为每个不同的获取页面源(向页面发送请求)。您在crawler()函数中缺少了该部分。将这些行添加到函数之外，不会递归地调用它们。

def crawler(url, depth):
    if depth == 0:
        return None

    html = urlopen(url)                        # You were missing 
    soup = BeautifulSoup(html, 'html.parser')  # these lines.

    links = soup.find("div",{"id" : "bodyContent"}).findAll("a", href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))

    print("Level ", depth, url)
    for link in links:
        if ':' not in link['href']:
            crawler("https://en.wikipedia.org"+link['href'], depth - 1)

url = "https://en.wikipedia.org/wiki/Big_data"
crawler(url, 3)

部分产出：

Level  3 https://en.wikipedia.org/wiki/Big_data
Level  2 https://en.wikipedia.org/wiki/Big_Data_(band)
Level  1 https://en.wikipedia.org/wiki/Brooklyn
Level  1 https://en.wikipedia.org/wiki/Electropop
Level  1 https://en.wikipedia.org/wiki/Alternative_dance
Level  1 https://en.wikipedia.org/wiki/Indietronica
Level  1 https://en.wikipedia.org/wiki/Indie_rock
Level  1 https://en.wikipedia.org/wiki/Warner_Bros._Records
Level  1 https://en.wikipedia.org/wiki/Joywave
Level  1 https://en.wikipedia.org/wiki/Electronic_music
Level  1 https://en.wikipedia.org/wiki/Dangerous_(Big_Data_song)
Level  1 https://en.wikipedia.org/wiki/Joywave
Level  1 https://en.wikipedia.org/wiki/Billboard_(magazine)
Level  1 https://en.wikipedia.org/wiki/Alternative_Songs
Level  1 https://en.wikipedia.org/wiki/2.0_(Big_Data_album)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49120376

复制

相似问题

问网络爬虫递归BeautifulSoup
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问网络爬虫递归BeautifulSoupEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问网络爬虫递归BeautifulSoup
EN