blocks|key|193892|text|这里有一种使用tm_filter的方法：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|193893|library(tm)
reut21578+<-+system.file("texts",+"crude",+package+=+"tm")
corp+<-+VCorpus(DirSource(reut21578),+list(reader+=+readReut21578XMLasPlain))

(+corp_sub+<-+tm_filter(corp,+function(x)+any(grep("price+reduction",+content(x),+fixed=TRUE)))+)
#+<<VCorpus>>
#+Metadata:++corpus+specific:+0,+document+level+(indexed):+0
#+Content:++documents:+1

cat(content(corp_sub[[1]]))
#+Diamond+Shamrock+Corp+said+that
#+effective+today+it+had+cut+its+contract+prices+for+crude+oil+by
#+1.50+dlrs+a+barrel.
#+++++The+reduction+brings+its+posted+price+for+West+Texas
#+Intermediate+to+16.00+dlrs+a+barrel,+the+copany+said.
#+++++"The+price+reduction+today+was+made+in+the+light+of+falling+++#+<=====
#+oil+product+prices+and+a+weak+crude+oil+market,"+a+company
#+spokeswoman+said.
#+++++Diamond+is+the+latest+in+a+line+of+U.S.+oil+companies+that
#+have+cut+its+contract,+or+posted,+prices+over+the+last+two+days
#+citing+weak+oil+markets.
#++Reuter|code-block|syntax|javascript|193894|我是怎么到那里的？通过查看套餐精巧，搜索子集，然后查看其中提到的tm_filter的示例(help：?tm_filter)。检查模式匹配选项的?grep也可能是值得的。|193895|entityMap|0|LINK|mutability|MUTABLE|url|https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf^0|7|9|0|0|W|9|1E|A|1Z|5|D|4|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@$9|V|A|W|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|X|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|Y|8|@$9|Z|A|10|B|C]|$9|11|A|12|B|C]|$9|13|A|14|B|C]]|D|@$9|15|A|16|1|17]]|E|$]]|$1|M|3|-4|5|6|7|18|8|@]|D|@]|E|$]]]|N|$O|$5|P|Q|R|E|$S|T]]]]

Here's one way using <code>tm_filter</code>: 

<pre><code>library(tm)
reut21578 &lt;- system.file("texts", "crude", package = "tm")
corp &lt;- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

( corp_sub &lt;- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) )
# &lt;&lt;VCorpus&gt;&gt;
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 1

cat(content(corp_sub[[1]]))
# Diamond Shamrock Corp said that
# effective today it had cut its contract prices for crude oil by
# 1.50 dlrs a barrel.
# The reduction brings its posted price for West Texas
# Intermediate to 16.00 dlrs a barrel, the copany said.
# "The price reduction today was made in the light of falling # &lt;=====
# oil product prices and a weak crude oil market," a company
# spokeswoman said.
# Diamond is the latest in a line of U.S. oil companies that
# have cut its contract, or posted, prices over the last two days
# citing weak oil markets.
# Reuter
</code></pre>

How did I get there? By looking into the <a href="https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf" rel="nofollow">packages' vignette</a>, searching for subset, and then looking at the examples for <code>tm_filter</code> (help: <code>?tm_filter</code>), which is mentioned there. It might also be worth looking at <code>?grep</code> to inspect the options for pattern matching.

blocks|key|193905|text|@lukeA的解决方案有效。我想给出另一个我更喜欢的解决方案。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|193906|++++library(tm)

++++++++reut21578+<-+system.file("texts",+"crude",+package+=+"tm")
++++++++corp+<-+VCorpus(DirSource(reut21578),+list(reader+=+readReut21578XMLasPlain))

++++++++corpTF+<-+lapply(corp,+function(x)+any(grep("price+reduction",+content(x),+fixed=TRUE)))

++++++++for(i+in+1:length(corp))+
++++++++++corp[[i]]$meta["mySubset"]+<-+corpTF[i]

++++++++idx+<-+meta(corp,+tag+="mySubset")+==+'TRUE'
++++++++filtered+<-+corp[idx]

++++++++cat(content(filtered[[1]]))|code-block|syntax|javascript|193907|通过使用元标记，我们可以看到所有的语料库元素都有一个选择标签mySubset，值'TRUE‘表示我们所选的，而值'FALSE’则是相反的。|193908|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

@lukeA's solution works. I want to give another solution I prefer.

<pre><code> library(tm)

 reut21578 &lt;- system.file("texts", "crude", package = "tm")
 corp &lt;- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

 corpTF &lt;- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))

 for(i in 1:length(corp)) 
 corp[[i]]$meta["mySubset"] &lt;- corpTF[i]

 idx &lt;- meta(corp, tag ="mySubset") == 'TRUE'
 filtered &lt;- corp[idx]

 cat(content(filtered[[1]]))
</code></pre>

Advantage of this solution by using meta tags, we can see all corpus elements with a selection tag mySubset, value 'TRUE' for our selected ones, and value 'FALSE' for otherwise.

blocks|key|950783|text|这里有一种使用quanteda包的更简单的方法，它与重用已经为其他R对象定义的现有方法的方式更加一致。quanteda有一个用于语料库对象的subset方法，它的工作方式与data.frame的子集方法一样，但是在逻辑向量上选择，包括在语料库中定义的文档变量。下面，我使用语料库对象的texts()方法从语料库中提取文本，并使用该方法在grep()中搜索您的一对单词。|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|CODE|entityRanges|data|950784|require(tm)
data(crude)

require(quanteda)
#+corpus+constructor+recognises+tm+Corpus+objects+
(qcorpus+<-+corpus(crude))
##+Corpus+consisting+of+20+documents.
#+use+subset+method
(qcorpussub+<-+corpus_subset(qcorpus,+grepl("price\\s%2Breduction",+texts(qcorpus))))
##+Corpus+consisting+of+1+document.

#+see+the+context
## kwic(qcorpus,+"price+reduction")
##+++++++++++++++++++++++contextPre+++++++++keyword+++++++++++++contextPost
##+[127,+45:46]+copany+said."+The+[+price+reduction+]+today+was+made+in+the|code-block|syntax|javascript|950785|注意:我用"\s%2B“来分隔正则表达式，因为您可以有一些空格、制表符或换行符，而不仅仅是一个空格。|950786|entityMap^0|7|8|1F|8|1Y|6|2E|A|3Y|7|4O|6|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@$9|Q|A|R|B|C]|$9|S|A|T|B|C]|$9|U|A|V|B|D]|$9|W|A|X|B|D]|$9|Y|A|Z|B|D]|$9|10|A|11|B|D]]|E|@]|F|$]]|$1|G|3|H|5|I|7|12|8|@]|E|@]|F|$J|K]]|$1|L|3|M|5|6|7|13|8|@]|E|@]|F|$]]|$1|N|3|-4|5|6|7|14|8|@]|E|@]|F|$]]]|O|$]]

Here's a simpler way using the quanteda package, and one more consistent with the way that reuses existing methods already defined for other R objects. quanteda has a <code>subset</code> method for corpus objects that works just like the subset method for a <code>data.frame</code>, but selects on logical vectors including document variables defined in the corpus. Below, I have extracted the texts from the corpus using the <code>texts()</code> method for corpus objects, and used that in a <code>grep()</code> to search for your pair of words. 

<pre><code>require(tm)
data(crude)

require(quanteda)
# corpus constructor recognises tm Corpus objects 
(qcorpus &lt;- corpus(crude))
## Corpus consisting of 20 documents.
# use subset method
(qcorpussub &lt;- corpus_subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus))))
## Corpus consisting of 1 document.

# see the context
## kwic(qcorpus, "price reduction")
## contextPre keyword contextPost
## [127, 45:46] copany said." The [ price reduction ] today was made in the
</code></pre>

Note: I spaced your regex with "\s+" since you could have some variation of spaces, tabs, or newlines instead of just a single space.

I'm using R and the tm package to do some text analysis. 
I'm trying to build a subset of a corpus based on whether a certain expression is found within the content of the individual text files. 

I create a corpus with 20 textfiles (thank you lukeA for this example): 

<pre><code>reut21578 &lt;- system.file("texts", "crude", package = "tm")
corp &lt;- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
</code></pre>

I now would like to select only those textfiles that contain the string "price reduction" to create a subset-corpus. 

Inspecting the first textfile of the document, I know that there is at least one textfile containing that string: 

<pre><code>writeLines(as.character(corp[1]))
</code></pre>

How would I best go about doing this?

Subsetting a corpus based on content of textfile

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在使用R和tm包来做一些文本分析。我试图根据某个表达式是否在各个文本文件的内容中找到，来构建一个语料库的子集。我创建了一个包含20个文本文件的语料库(谢谢lukeA给出了这个例子)：reut21578 <- system.file("texts", "crude", package = "tm")corp <- V...

问基于文本文件内容的语料库划分
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于文本文件内容的语料库划分EN