文章/答案/技术大牛

发布

社区首页 >问答首页 >R:如何基于行字符串创建集群

问R:如何基于行字符串创建集群
EN

Stack Overflow用户

提问于 2017-11-13 14:05:34

回答 1查看 678关注 0票数 3

我正在尝试根据每一行的字符串值从数据创建集群。我用的是R语言。我称之为“集群”的是一个大主题(=系列)，它可以定义每个关键字。我想象一些基于关键字的自动生成，可能是使用柠檬化或ngram。

例如，关键字“云服务”和“云服务”都应该位于“服务”集群中。

这是我的输入向量：

keywords_df <- c("cloud storage", "cloud computing", "google cloud storage", "the cloud service", 
        "free cloud storage", "what is cloud computing", "best cloud storage","cloud computing definition", 
        "amazon cloud services", "cloud service providers", "cloud services", "google cloud computing", "cloud computing services", "benefits of cloud computing")

以下是预期的输出数据：

| Keyword                   |  Thematic |
|---------------------------|:---------:|
|cloud storage              |storage  |
|cloud computing            |computing|
|google cloud storage       |storage  |
|the cloud service          |service  |
|free cloud storage         |storage  |
|what is cloud computing    |computing|
|best cloud storage         |storage  |
|cloud computing definition |computing|
|amazon cloud service       |service |
|cloud service providers        |services |
|cloud service              |service |
|google cloud computing     |computing|
|cloud computing services   |service |
|benefits of cloud computing|computing|

目标是清除“关键字”列中的数据，并自动提取一种lemm或ngram。

以下是我目前所做的工作：

根据关键字列创建“主题”列： keywords_df <-突变(keywords_df，Thematic=Keyword) keywords_df$Thematic <- as.character(keywords_df$Thematic)
移除死字： _list <- (c(“云”))#移除主词塞子<-秒词(类= "en")，停止词<-追加(秒词，stopwords_list) x= keywords_df$Thematic x= removeWords(x，秒词) keywords_df$Thematic <-x

nlp

n-gram

lemmatization

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-11-13 15:47:58

您可以通过使用storage、computing和service来检查特定单词的存在。这样，您就可以在df中检查给定的单词是否存在。

fams   <- c("storage", "computing", "service")
family <- rep("emtpy_fam", length(df))

for(fam in fams){
  family[grepl(fam, Keywords)] <- fam
}
cbind(df, family)
#      Keywords                      family     
# [1,] "cloud storage"               "storage"  
# [2,] "cloud computing"             "computing"
---
#[13,] "cloud computing services"    "service"  
#[14,] "benefits of cloud computing" "computing"

当然，有更好的方法可以做到这一点。

编辑：更好的方法，使用stringr包

library(stringr)
family <- str_extract(df, pattern="storage|computing|service")
cbind(df, family)

Edit2: --我看到了你最近的编辑，表明你在寻找非事先指定的家庭描述。在这种情况下，我想到的第一种方法是潜Dirichlet分配 (LDA -不要与线性判别分析混淆)。

LDA分析文档的语料库，并将潜在的主题识别为单词的分布(类似于下面的terms(lda.output) )，并识别哪些文档属于哪个主题(与下面的topic(lda.output)类似)：

library(topicmodels)
library(tm)

# Preliminary textmining
corpus <- Corpus(VectorSource(df))
corpus <- tm_map(corpus, removeWords, "cloud")
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

# Term Frequency matrix
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))

lda.output <- LDA(TF, k=3)
terms(lda.output)
# Topic 1  Topic 2  Topic 3 
# "servic" "comput" "storag"

cbind(df, terms(lda.output)[topics(lda.output)])
#            df                                    
#Topic 3 "cloud storage"               "storag"
#Topic 2 "cloud computing"             "comput"
#Topic 3 "google cloud storage"        "storag"
#Topic 1 "cloud services"              "servic"
#Topic 3 "free cloud storage"          "storag"
#Topic 2 "what is cloud computing"     "comput"
#Topic 3 "best cloud storage"          "storag"
#Topic 1 "cloud computing definition"  "servic"
#Topic 1 "amazon cloud services"       "servic"
#Topic 3 "cloud service providers"     "storag"
#Topic 2 "google cloud services"       "comput"
#Topic 2 "google cloud computing"      "comput"
#Topic 1 "cloud computing services"    "servic"
#Topic 2 "benefits of cloud computing" "comput"

最后注意:如果您希望获得"computing"而不是"comput"等，则应该更改文本挖掘中的词干部分。你也可以忽略这一点，但是"service"和"services"不会被认为是同一个词。但是，您可以手动将"service"替换为"services"，反之亦然。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47266183

复制

相似问题

问R:如何基于行字符串创建集群
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R:如何基于行字符串创建集群EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R:如何基于行字符串创建集群
EN