我想从Project Gutenberg下载一个文本,我已经完成了以下代码:
setwd("D:\\sourceCode")
TEXTFILE = "pg100.txt"
if (!file.exists(TEXTFILE)) {
download.file("http://www.gutenberg.org/cache/epub/100/pg100.txt", destfile = TEXTFILE)
}
shakespeare = readLines(TEXTFILE)
我遇到的问题是,我收到了以下消息:
Warning messages:
1: In readLines(TEXTFILE) : invalid or incomplete compressed data
2: In readLines(TEXTFILE) : incomplete final line found on 'pg100.txt'
实际上,我正在遵循中的教程:
https://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/
然后,当我想通过以下命令获得文档的长度时:
长度(莎士比亚)
我得到的数据是:
[1] 55
但根据我之前给出的链接的教程,数据应该是:
[1] 124787
怎么啦?谢谢
发布于 2018-05-02 05:49:41
下载的文件是gzip
归档文件,而不是txt文件。
要么手动解压缩,要么执行
TEXTFILE = "pg100.txt.gz"
if (!file.exists(TEXTFILE)) {
download.file("http://www.gutenberg.org/cache/epub/100/pg100.txt", destfile =
TEXTFILE)
}
shakespeare = readLines(gzfile(TEXTFILE))
head(shakespeare)
#[1] "The Project Gutenberg EBook of The Complete Works of William Shakespeare, by"
#[2] "William Shakespeare"
#[3] ""
#[4] "This eBook is for the use of anyone anywhere at no cost and with"
#[5] "almost no restrictions whatsoever. You may copy it, give it away or"
#[6] "re-use it under the terms of the Project Gutenberg License included"
length(shakespeare)
#[1] 124787
更新
在Windows中,您似乎需要设置显式二进制传输模式(因为所讨论的文件不是文本文件,而是二进制归档文件):
TEXTFILE = "pg100.txt.gz"
if (!file.exists(TEXTFILE)) {
download.file("http://www.gutenberg.org/cache/epub/100/pg100.txt", destfile =
TEXTFILE, mode = "wb")
}
shakespeare = readLines(gzfile(TEXTFILE))
发布于 2018-05-02 05:47:59
只需按如下方式包含mode
参数:
download.file("http://www.gutenberg.org/cache/epub/100/pg100.txt", destfile = TEXTFILE, mode = "wb")
发布于 2018-05-02 17:02:08
由于您正在尝试从http://www.gutenberg.org读取文件,因此可以使用gutenbergr
包。你所指的教程已经很旧了,当它被写出来的时候这个包还不存在。该软件包的优点是,根据您的操作系统,您不会遇到下载/ readLines问题。
library(gutenbergr)
# if you know which gutenberg id it is. (EBook-No. on gutenberg website)
# otherwise use other gutenbergr code to find authors and works. See vignette for more info
# complete works of Shakespeare is EBook-No. 100
shakespeare <- gutenberg_download(100)
head(shakespeare)
# A tibble: 6 x 2
gutenberg_id text
<int> <chr>
1 100 Shakespeare
2 100 ""
3 100 *This Etext has certain copyright implications you should read!*
4 100 ""
5 100 <<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
6 100 SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
https://stackoverflow.com/questions/50124085
复制相似问题