首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >带有标题和摘要的R中的Web爬虫

带有标题和摘要的R中的Web爬虫
EN

Stack Overflow用户
提问于 2017-06-15 22:11:22
回答 1查看 89关注 0票数 0

我正在尝试从here中提取带有文章标题和每个链接的简要摘要的链接。输出应该有文章标题和每篇文章的简要摘要,这是在同一页上。

我能拿到链接。你能建议我如何获得每个链接的标题和摘要。请看我下面的代码。

代码语言:javascript
运行
复制
install.packages('rvest')

#Loading the rvest package
library('rvest')
library(xml2)


#Specifying the url for desired website to be scrapped
url <- 'http://money.howstuffworks.com/business-profiles.htm'


webpage <- read_html(url)

pg <- read_html(url)

head(html_attr(html_nodes(pg, "a"), "href"))
EN

回答 1

Stack Overflow用户

发布于 2017-06-15 22:32:36

我们可以使用purrr检查每个节点并提取相关信息:

代码语言:javascript
运行
复制
library(rvest)
library(purrr)

url <- 'http://money.howstuffworks.com/business-profiles.htm'
articles <- read_html(url) %>% 
    html_nodes('.infinite-item > .media') %>% 
    map_df(~{
        title <- .x %>% 
            html_node('.media-heading > h3') %>% 
            html_text()

        head <- .x %>% 
            html_node('p') %>% 
            html_text()

        link <- .x %>% 
            html_node('p > a') %>% 
            html_attr('href')

        data.frame(title, head, link, stringsAsFactors = F)
    })

head(articles)
#>                                                             title
#> 1                              How Amazon Same-day Delivery Works
#> 2              10 Companies That Completely Reinvented Themselves
#> 3                                10 Trade Secrets We Wish We Knew
#> 4                                           How Kickstarter Works
#> 5                          Can you get rich selling stuff online?
#> 6 Are the Golden Arches really supposed to be giant french fries?
#>                                                                                                                                                           head
#> 1                 The Amazon same-day delivery service aims to get your package to you in no time at all. Learn how Amazon same-day delivery works. See more »
#> 2 You might be surprised at what some of today's biggest companies used to do. Here are 10 companies that reinvented themselves from HowStuffWorks. See more »
#> 3              Trade secrets are often locked away in corporate vaults, making their owners a fortune. Which trade secrets are the stuff of legend? See more »
#> 4        Kickstarter is a service that utilizes crowdsourcing to raise funds for your projects. Learn about how Kickstarter works at HowStuffWorks. See more »
#> 5                                                   Can you get rich selling your stuff online? Find out more in this article by HowStuffWorks.com. See more »
#> 6     Are McDonald's golden arches really suppose to be giant french fries? Check out this article for a brief history of McDonald's golden arches. See more »
#>                                                                    link
#> 1           http://money.howstuffworks.com/amazon-same-day-delivery.htm
#> 2 http://money.howstuffworks.com/10-companies-reinvented-themselves.htm
#> 3                   http://money.howstuffworks.com/10-trade-secrets.htm
#> 4                        http://money.howstuffworks.com/kickstarter.htm
#> 5    http://money.howstuffworks.com/can-you-get-rich-selling-online.htm
#> 6                   http://money.howstuffworks.com/mcdonalds-arches.htm

强制性评论:在这种情况下,我在他们的Terms and conditions上没有看到反对收获的免责声明,但在抓取之前一定要检查网站的条款。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/44569777

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档