文章/答案/技术大牛

发布

社区首页 >问答首页 >基于文本文件内容的语料库划分

问基于文本文件内容的语料库划分
EN

Stack Overflow用户

提问于 2016-03-24 12:37:09

回答 3查看 1.6K关注 0票数 2

我正在使用R和tm包来做一些文本分析。我试图根据某个表达式是否在各个文本文件的内容中找到，来构建一个语料库的子集。

我创建了一个包含20个文本文件的语料库(谢谢lukeA给出了这个例子)：

reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

我现在只想选择那些包含字符串“降价”的文本文件来创建一个子集-语料库。

检查文档的第一个文本文件时，我知道至少有一个文本文件包含该字符串：

writeLines(as.character(corp[1]))

我该怎么做呢？

corpus

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-03-24 15:41:39

这里有一种使用tm_filter的方法：

library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

( corp_sub <- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) )
# <<VCorpus>>
# Metadata:  corpus specific: 0, document level (indexed): 0
# Content:  documents: 1

cat(content(corp_sub[[1]]))
# Diamond Shamrock Corp said that
# effective today it had cut its contract prices for crude oil by
# 1.50 dlrs a barrel.
#     The reduction brings its posted price for West Texas
# Intermediate to 16.00 dlrs a barrel, the copany said.
#     "The price reduction today was made in the light of falling   # <=====
# oil product prices and a weak crude oil market," a company
# spokeswoman said.
#     Diamond is the latest in a line of U.S. oil companies that
# have cut its contract, or posted, prices over the last two days
# citing weak oil markets.
#  Reuter

我是怎么到那里的？通过查看套餐精巧，搜索子集，然后查看其中提到的tm_filter的示例(help：?tm_filter)。检查模式匹配选项的?grep也可能是值得的。

票数 2

Stack Overflow用户

发布于 2016-03-24 21:27:50

这里有一种使用quanteda包的更简单的方法，它与重用已经为其他R对象定义的现有方法的方式更加一致。quanteda有一个用于语料库对象的subset方法，它的工作方式与data.frame的子集方法一样，但是在逻辑向量上选择，包括在语料库中定义的文档变量。下面，我使用语料库对象的texts()方法从语料库中提取文本，并使用该方法在grep()中搜索您的一对单词。

require(tm)
data(crude)

require(quanteda)
# corpus constructor recognises tm Corpus objects 
(qcorpus <- corpus(crude))
## Corpus consisting of 20 documents.
# use subset method
(qcorpussub <- corpus_subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus))))
## Corpus consisting of 1 document.

# see the context
## kwic(qcorpus, "price reduction")
##                       contextPre         keyword             contextPost
## [127, 45:46] copany said." The [ price reduction ] today was made in the

注意:我用"\s+“来分隔正则表达式，因为您可以有一些空格、制表符或换行符，而不仅仅是一个空格。

票数 4

Stack Overflow用户

发布于 2016-03-24 19:53:15

@lukeA的解决方案有效。我想给出另一个我更喜欢的解决方案。

    library(tm)

        reut21578 <- system.file("texts", "crude", package = "tm")
        corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

        corpTF <- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))

        for(i in 1:length(corp)) 
          corp[[i]]$meta["mySubset"] <- corpTF[i]

        idx <- meta(corp, tag ="mySubset") == 'TRUE'
        filtered <- corp[idx]

        cat(content(filtered[[1]]))

通过使用元标记，我们可以看到所有的语料库元素都有一个选择标签mySubset，值'TRUE‘表示我们所选的，而值'FALSE’则是相反的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/36200387

复制

相似问题

问基于文本文件内容的语料库划分
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于文本文件内容的语料库划分EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于文本文件内容的语料库划分
EN