我想做一些NLP的歌曲歌词分类心情几十年。现在,给出一个特定艺术家的歌词页面,比如史密斯家,我的首页显示了所有的歌曲名称:
https://www.azlyrics.com/s/smiths.html
绕着喷泉旋转\n
你现在拥有一切\n
.
其中,每个标题都是指向实际歌词页面的链接。
https://www.azlyrics.com/lyrics/smiths/reelaroundthefountain.html https://www.azlyrics.com/lyrics/smiths/youvegoteverythingnow.html
现在,如何去刮所有的歌词从https://www.azlyrics.com/lyrics/smiths/XXX.html,其中XXX是标题上的第一页https://www.azlyrics.com/s/smiths.html。
感谢你的帮助!就像我写的R或者Python。其实并不重要。最好,我希望每个歌词保存在单独的*.txt文件。
我试过这个:
from bs4 import BeautifulSoup
import requests
list =[title1, title2, .....]
for x in list:
url= "https://www.azlyrics.com/lyrics/smiths?x".format(str)
r=requests.get(url)
soup= BeautifulSoup(r.text)
for span in soup.findAll('span', attrs={'class': 'views-field views-field-created'}) :
print r.get_text()但失败了。但是,如果随后的页面被编号,它就能工作。
发布于 2020-04-08 13:48:37
import requests
from bs4 import BeautifulSoup
# GET request to scrape the page for lyric links
r = requests.get('https://www.azlyrics.com/s/smiths.html')
# create soup
soup = BeautifulSoup(r.text, 'lxml')
# base url
url = 'https://www.azlyrics.com/'
# list comprehension to get all the links to the song lyrics
album_list = [url+a['href'].strip('..') for a in soup.find(id='listAlbum').findAll('a', href=True)]
for song in album_list:
# do stuff with song
# resp = requests.get(song)
# song_soup = BeautifulSoup(resp.text, 'lxml')
# etc.发布于 2020-04-08 14:00:58
在R中,我们可以使用rvest。
首先,我们得到所有的歌词链接。
library(rvest)
url <- "https://www.azlyrics.com/s/smiths.html"
all_links <- url %>%
read_html() %>%
html_nodes('div.listalbum-item a') %>%
html_attr('href') %>%
{paste0('https://www.azlyrics.com/', sub('../', '', ., fixed = TRUE))}然后从all_links的每一页中获取歌词。
all_lyrics <- purrr::map(all_links, ~.x %>%read_html() %>% html_nodes('div') %>% .[[20]] %>% html_text())https://stackoverflow.com/questions/61101772
复制相似问题