运用
library(htm2txt)
url <- 'https://en.wikipedia.org/wiki/Alan_Turing'
clear.text <- gettxt(url)
代码我得到了
clear.text
[1] "Alan Turing\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation\tJump to search\n\n\"Turing\" redirects here. For other uses, see Turing (disambiguation).\n\nmathematician and computer scientist\n\nAlan Turing\n\nOBE FRS\n\nTuring aged 16\n\nBorn (1912-06-23)23 June 1912\n\nM...
这个数据我想存储在tidy的对象中,如:
tidy.text <- tidy(clear.text)
但我明白了
'tidy.character' is deprecated.
结果是
# A tibble: 1 x 1
x
<chr>
1 "Alan Turing\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation\tJum
>
因此,我如何将这样的纯文本转换为tidy格式?
谢谢你的任何建议。
发布于 2018-12-19 11:55:55
如果你有一个维基百科链接或其他HTML,unnest_tokens()
tidytext中的函数可以直接解析和tidy它。
library(tidytext)
library(tidyverse)
read_lines("https://en.wikipedia.org/wiki/Alan_Turing") %>%
data_frame(text = .) %>%
unnest_tokens(word, text, format = "html")
#> # A tibble: 15,460 x 1
#> word
#> <chr>
#> 1 alan
#> 2 turing
#> 3 wikipedia
#> 4 this
#> 5 is
#> 6 a
#> 7 good
#> 8 article
#> 9 follow
#> 10 the
#> # ... with 15,450 more rows
https://stackoverflow.com/questions/-100006325
复制相似问题