文章/答案/技术大牛

发布

社区首页 >问答首页 >有什么方法可以将quanteda记号分成n个相等的部分吗？

问有什么方法可以将quanteda记号分成n个相等的部分吗？
EN

Stack Overflow用户

提问于 2020-11-24 15:41:05

回答 1查看 134关注 0票数 1

我正在使用R中的quanteda包执行文本分析。

我已经标记了一组文本文档。每个令牌由不同数量的令牌组成。我希望将令牌拆分为N个相等的令牌块(例如，每个文本包含相同数量的令牌的10或20个标记)。

假设我的数据名为text_docs，如下所示：

Text  | Tokens
Text1 | "this" "is" "an" "example" "this" "is" "an" "example"
Text2 | "this" "is" "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" "an" "example" "this" "is" "an" "example"

我希望得到的结果应该如下所示(用两块而不是二十块)：

Text  | Chunk1                                 | Chunk2
Text1 | "this" "is" "an" "example"             | "this" "is" "an" "example"
Text2 | "this" "is"                            | "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" | "an" "example" "this" "is" "an" "example"

我知道tokens_chunk函数在quanteda中。然而，这个函数只允许我创建一组大小相等的块(例如，每个块由两个令牌组成)，这就给我留下了不同数量的块。此外，size函数中的命令tokens_chunk必须是一个整数，这就是为什么我不能简单地执行这个chunks <- tokens_chunk(text_docs, size = ntokens(text_docs)/20)。

有什么想法吗？

提前谢谢你。

nlp

quanteda

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-11-24 17:50:21

library("quanteda")
## Package version: 2.1.2

toks <- c(
  Text1 = "this is an example this is an example",
  Text2 = "this is an example",
  Text3 = "this is an example this is an example this is an example"
) %>%
  tokens()

toks
## Tokens consisting of 3 documents.
## Text1 :
## [1] "this"    "is"      "an"      "example" "this"    "is"      "an"     
## [8] "example"
## 
## Text2 :
## [1] "this"    "is"      "an"      "example"
## 
## Text3 :
##  [1] "this"    "is"      "an"      "example" "this"    "is"      "an"     
##  [8] "example" "this"    "is"      "an"      "example"

有一种方法可以做你想做的事。我们将对文档名进行应用，将每个文档切片，然后使用大小等于其长度一半的tokens_chunk()对其进行拆分。在这里，我还使用了ceiling，这样如果文档的令牌长度是奇怪的，那么它在第一个拆分中将比在第二个标记中多一个令牌。(您的示例全部用于偶数标记文档，但这也处理了奇数标记的情况。)

lis <- lapply(
  docnames(toks),
  function(x) tokens_chunk(toks[x], size = ceiling(ntoken(toks[x]) / 2))
)

这将产生一个拆分令牌的列表，您可以使用连接令牌的c()函数来重新组合它们。使用do.call()将其应用到列表中。

do.call("c", lis)
## Tokens consisting of 6 documents.
## Text1.1 :
## [1] "this"    "is"      "an"      "example"
## 
## Text1.2 :
## [1] "this"    "is"      "an"      "example"
## 
## Text2.1 :
## [1] "this" "is"  
## 
## Text2.2 :
## [1] "an"      "example"
## 
## Text3.1 :
## [1] "this"    "is"      "an"      "example" "this"    "is"     
## 
## Text3.2 :
## [1] "an"      "example" "this"    "is"      "an"      "example"

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64989835

复制

相似问题

问有什么方法可以将quanteda记号分成n个相等的部分吗？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有什么方法可以将quanteda记号分成n个相等的部分吗？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有什么方法可以将quanteda记号分成n个相等的部分吗？
EN