问使用rvest从HTML表中进行Web抓取
EN

Stack Overflow用户

提问于 2018-06-16 01:29:11

回答 1查看 574关注 0票数 0

我是web抓取的新手，我正在尝试抓取下表：

                    <table class="dp-firmantes table table-condensed table->striped">
                        <thead>
                            <tr>
                                <th>FIRMANTE</th>
                                <th>DISTRITO</th>
                                <th>BLOQUE</th>
                            </tr>
                        </thead>
                        <tbody>

                            <tr>
                                <td>ROMERO, JUAN CARLOS</td>
                                <td>SALTA</td>
                                <td>JUSTICIALISTA 8 DE OCTUBRE</td>
                            </tr>
                            <tr>
                                <td>FIORE VIÑUALES, MARIA CRISTINA DEL >VALLE</td>
                            <td>SALTA</td>
                                <td>PARES</td>
                            </tr>
                            </tbody>
                    </table>

我使用的是rvest包，代码如下：

link <- read_html("https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?")
table <- html_nodes(link, 'table.dp-firmantes table table-condensed table-striped')

但是当我查看R中的'table‘对象时，我得到了以下错误：{xml_nodeset (0)}

我的直觉是，我实际上没有从表中抓取任何html内容，但我不知道如何解决这个问题/为什么会发生这种情况。我不确定错误是在我的R代码中，还是我使用了错误的CSS选择器，或者这可能是javascript代码而不是html？请告诉我我哪里做错了。

编辑:这是我正在使用的https://www.hcdn.gob.ar/proyectos/resultados-buscador.html链接

Edited: screenshot of the search results table

javascript

html

web-scraping

回答 1

Stack Overflow用户

发布于 2018-06-16 06:34:13

您可以尝试使用以下代码来解析包含"Listado de Autores“表的那些账单。例如，费用为n.820/18(链接= http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL)的账单就有这个表，但我在网上搜索了前500张账单，没有找到任何包含此类数据的其他账单。

library(tidyverse)
library(rvest)

html_object <- read_html('http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL')

html_object %>% 
html_node(xpath = "//div[@id = 'Autores']/table") %>% # This is the xpath adress that worked for me. The CSS locator ypu provide did not work.
html_table() %>% as_data_frame() %>% ## Get the html table and store it in a tibble
mutate(X1 = gsub("\\n|\\t|  ", "", X1)) ##Remove the extra line brakes (\\n), tabs (\\t), and spaces ("  ") present in the html table.

结果：

# A tibble: 2 x 2
  X1
  <chr>
1 Romero, Juan Carlos
2 Fiore Viñuales, María Cristina Del Valle

编辑:通过read_html('https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?pagina=2')捕获Rśhtml截图

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50880250

复制

相似问题

问使用rvest从HTML表中进行Web抓取
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用rvest从HTML表中进行Web抓取EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用rvest从HTML表中进行Web抓取
EN