问使用Python的Web爬行器
EN

Stack Overflow用户

提问于 2018-02-13 12:12:15

回答 1查看 81关注 0票数 0

感谢您对我的问题感兴趣。我目前在大学学习计算机科学，我相信我对Python编程有很好的掌握。考虑到这一点，现在我正在学习全栈开发，我想用Python开发一个网络爬虫(因为我听说它很擅长这一点)，在Manta和Tradesi等网站上浏览，寻找没有网站的小企业，这样我就可以与他们的所有者取得联系，做一些公益工作，开始我的web开发生涯。问题是，我以前从来没有用任何语言制作过网络爬虫，所以我认为Stack Overflow的那些有帮助的人可以给我一些关于网络爬虫的见解，特别是我应该如何去学习如何制作它们，以及如何在那些特定的网站上实现它的想法。

任何意见都是值得感谢的。谢谢，祝你有一个愉快的一天/晚上！

python

web-applications

web-crawler

回答 1

Stack Overflow用户

发布于 2018-02-20 13:05:32

这是一种遍历URL数组并从每个URL导入数据的方法。

import urllib 
import re 
import json
dateslist = open("C:/Users/rshuell001/Desktop/dates/dates.txt").read() dateslistlist = thedates.split("\n")
for thedate in dateslist: 
    myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "w+") 
    myfile.close()

    htmltext = urllib.urlopen("http://www.hockey-reference.com/friv/dailyleaders.cgi?month=" + themonth + "& day=" theday "& year=" theyear "")
    data = json.load(htmltext)
    datapoints = data["data_values"]

    myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "a")
    for point in datapoints:
            myfile.write(str(symbol+","+str(point[0])+","+str(point[1])+"\n"))
    myfile.close()

import requests
from bs4 import BeautifulSoup

base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1

while current_page < 200:
    print(current_page)
    url = base_url + str(current_page)
    #current_page += 1
    r = requests.get(url)
    zute_soup = BeautifulSoup(r.text, 'html.parser')
    firme = zute_soup.findAll('div', {'class': 'jobs-item'})

    for title in firme:
        title1 = title.findAll('h6')[0].text
        print(title1)
        adresa = title.findAll('div', {'class': 'description'})[0].text
        print(adresa)
        kontakt = title.findAll('div', {'class': 'description'})[1].text
        print(kontakt)
        print('\n')
        page_line = "{title1}\n{adresa}\n{kontakt}".format(
            title1=title1,
            adresa=adresa,
            kontakt=kontakt
        )
    current_page += 1

请记住，做这类事情有很多很多方法，每个网站都是不同的，所以你想出的最终结果将是高度定制的，并根据其预期用途非常具体。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48759311

复制

相似问题

问使用Python的Web爬行器
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python的Web爬行器EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python的Web爬行器
EN