我正在尝试使用urllib和漂亮的汤用python编写一个刮板程序。我有一个用于新闻故事的URL的csv,对于大约80%的页面,刮刀是有效的,但是当故事的顶部有图片时,脚本不再提取时间或正文文本。我之所以感到困惑,主要是因为soup.find和soup.find_all似乎没有产生不同的结果。我尝试了各种不同的标签来捕获文本,以及'lxml‘和’html.parser‘。
代码如下:
testcount = 0
titles1 = []
bodies1 = []
times1 = []
data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")
    titlemess = soup.find(id="title").get_text() #getting the title
    titlestring = str(titlemess) #make it a string
    title = titlestring.replace("\n", "").replace("\r","")
    titles1.append(title)
    bodymess = soup.find(class_="article").get_text() #get the body with markup
    bodystring = str(bodymess) #make body a string
    body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
    bodies1.append(body) #add to list for export
    timemess = soup.find('span',{"class":"time"}).get_text()
    timestring = str(timemess)
    time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
    times1.append(time)
    testcount = testcount +1 #counter
    print(testcount)
except Exception as e:
    print(testcount, e)下面是我得到的一些结果(那些标记为'nonetype‘的结果是那些标题被成功拉取但正文/时间为空的结果)
1个http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm
2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm“NoneType”对象没有属性“get_text”
任何帮助都将不胜感激!谢谢。
编辑:我没有'10个名誉点‘,所以我不能发布更多的链接来测试,但如果你需要更多的页面示例,我会用它们进行评论。
https://stackoverflow.com/questions/38099931
复制相似问题