文章/答案/技术大牛

发布

社区首页 >问答首页 >使用R或Python抓取网页和相关的后续页面

问使用R或Python抓取网页和相关的后续页面
EN

Stack Overflow用户

提问于 2020-04-08 13:23:30

回答 2查看 169关注 0票数 1

我想做一些NLP的歌曲歌词分类心情几十年。现在，给出一个特定艺术家的歌词页面，比如史密斯家，我的首页显示了所有的歌曲名称：

https://www.azlyrics.com/s/smiths.html

绕着喷泉旋转\n

你现在拥有一切\n

其中，每个标题都是指向实际歌词页面的链接。

https://www.azlyrics.com/lyrics/smiths/reelaroundthefountain.html https://www.azlyrics.com/lyrics/smiths/youvegoteverythingnow.html

现在，如何去刮所有的歌词从https://www.azlyrics.com/lyrics/smiths/XXX.html，其中XXX是标题上的第一页https://www.azlyrics.com/s/smiths.html。

感谢你的帮助！就像我写的R或者Python。其实并不重要。最好，我希望每个歌词保存在单独的*.txt文件。

我试过这个：

    from bs4 import BeautifulSoup
import requests
list =[title1, title2, .....]
for x in list:
    url= "https://www.azlyrics.com/lyrics/smiths?x".format(str)
    r=requests.get(url)
    soup= BeautifulSoup(r.text)

    for span in soup.findAll('span', attrs={'class': 'views-field views-field-created'}) :
        print r.get_text()

但失败了。但是，如果随后的页面被编号，它就能工作。

python

pandas

web-scraping

nlp

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-04-08 13:48:37

import requests
from bs4 import BeautifulSoup

# GET request to scrape the page for lyric links
r = requests.get('https://www.azlyrics.com/s/smiths.html')
# create soup
soup = BeautifulSoup(r.text, 'lxml')
# base url
url = 'https://www.azlyrics.com/'
# list comprehension to get all the links to the song lyrics
album_list = [url+a['href'].strip('..') for a in soup.find(id='listAlbum').findAll('a', href=True)]

for song in album_list:
    # do stuff with song
    # resp = requests.get(song)
    # song_soup = BeautifulSoup(resp.text, 'lxml')
    # etc.

票数 1

Stack Overflow用户

发布于 2020-04-08 14:00:58

在R中，我们可以使用rvest。

首先，我们得到所有的歌词链接。

library(rvest)

url <- "https://www.azlyrics.com/s/smiths.html"
all_links <- url %>%
              read_html() %>%
              html_nodes('div.listalbum-item a') %>%
              html_attr('href') %>%
         {paste0('https://www.azlyrics.com/', sub('../', '', ., fixed = TRUE))}

然后从all_links的每一页中获取歌词。

all_lyrics <- purrr::map(all_links, ~.x %>%read_html() %>% html_nodes('div') %>% .[[20]] %>% html_text())

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61101772

复制

相似问题

问使用R或Python抓取网页和相关的后续页面
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用R或Python抓取网页和相关的后续页面EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用R或Python抓取网页和相关的后续页面
EN