我正在尝试根据每一行的字符串值从数据创建集群。我用的是R语言。我称之为“集群”的是一个大主题(=系列),它可以定义每个关键字。我想象一些基于关键字的自动生成,可能是使用柠檬化或ngram。
例如,关键字“云服务”和“云服务”都应该位于“服务”集群中。
这是我的输入向量:
keywords_df <- c("cloud storage", "cloud computing", "google cloud storage", "the cloud service",
"free cloud storage", "what is cloud computing", "best cloud storage","cloud computing definition",
"amazon cloud services", "cloud service providers", "cloud services", "google cloud computing", "cloud computing services", "benefits of cloud computing")
以下是预期的输出数据:
| Keyword | Thematic |
|---------------------------|:---------:|
|cloud storage |storage |
|cloud computing |computing|
|google cloud storage |storage |
|the cloud service |service |
|free cloud storage |storage |
|what is cloud computing |computing|
|best cloud storage |storage |
|cloud computing definition |computing|
|amazon cloud service |service |
|cloud service providers |services |
|cloud service |service |
|google cloud computing |computing|
|cloud computing services |service |
|benefits of cloud computing|computing|
目标是清除“关键字”列中的数据,并自动提取一种lemm或ngram。
以下是我目前所做的工作:
发布于 2017-11-13 15:47:58
您可以通过使用storage
、computing
和service
来检查特定单词的存在。这样,您就可以在df
中检查给定的单词是否存在。
fams <- c("storage", "computing", "service")
family <- rep("emtpy_fam", length(df))
for(fam in fams){
family[grepl(fam, Keywords)] <- fam
}
cbind(df, family)
# Keywords family
# [1,] "cloud storage" "storage"
# [2,] "cloud computing" "computing"
---
#[13,] "cloud computing services" "service"
#[14,] "benefits of cloud computing" "computing"
当然,有更好的方法可以做到这一点。
编辑:更好的方法,使用stringr
包
library(stringr)
family <- str_extract(df, pattern="storage|computing|service")
cbind(df, family)
Edit2: --我看到了你最近的编辑,表明你在寻找非事先指定的家庭描述。在这种情况下,我想到的第一种方法是潜Dirichlet分配 (LDA -不要与线性判别分析混淆)。
LDA分析文档的语料库,并将潜在的主题识别为单词的分布(类似于下面的terms(lda.output)
),并识别哪些文档属于哪个主题(与下面的topic(lda.output)
类似):
library(topicmodels)
library(tm)
# Preliminary textmining
corpus <- Corpus(VectorSource(df))
corpus <- tm_map(corpus, removeWords, "cloud")
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
# Term Frequency matrix
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=3)
terms(lda.output)
# Topic 1 Topic 2 Topic 3
# "servic" "comput" "storag"
cbind(df, terms(lda.output)[topics(lda.output)])
# df
#Topic 3 "cloud storage" "storag"
#Topic 2 "cloud computing" "comput"
#Topic 3 "google cloud storage" "storag"
#Topic 1 "cloud services" "servic"
#Topic 3 "free cloud storage" "storag"
#Topic 2 "what is cloud computing" "comput"
#Topic 3 "best cloud storage" "storag"
#Topic 1 "cloud computing definition" "servic"
#Topic 1 "amazon cloud services" "servic"
#Topic 3 "cloud service providers" "storag"
#Topic 2 "google cloud services" "comput"
#Topic 2 "google cloud computing" "comput"
#Topic 1 "cloud computing services" "servic"
#Topic 2 "benefits of cloud computing" "comput"
最后注意:如果您希望获得"computing"
而不是"comput"
等,则应该更改文本挖掘中的词干部分。你也可以忽略这一点,但是"service"
和"services"
不会被认为是同一个词。但是,您可以手动将"service"
替换为"services"
,反之亦然。
https://stackoverflow.com/questions/47266183
复制相似问题