文章/答案/技术大牛

发布

社区首页 >问答首页 >用全德达语料库识别名词

问用全德达语料库识别名词
EN

Stack Overflow用户

提问于 2017-08-24 08:52:47

回答 1查看 1.5K关注 0票数 3

我正在使用肯·贝诺伊特和保罗·纳尔蒂的quanteda包来处理文本数据。

我的语料库包含有完整德语句子的文本，我只想处理每一个文本的名词。德语中的一个技巧是只使用大写单词，但这在句子开头就失败了。

Text1 <- c("Halle an der Saale ist die grünste Stadt Deutschlands")
Text2 <- c("In Hamburg regnet es immer, das ist also so wie in London.")
Text3 <- c("James Bond trinkt am liebsten Martini")

myCorpus <- corpus(c(Text1, Text2, Text3))
metadoc(myCorpus, "language") <- "german"
summary(myCorpus, showmeta = T)

myDfm <- dfm(myCorpus,  tolower = F, remove_numbers = T, 
                remove = stopwords("german"), remove_punct = TRUE, 
                    remove_separators = T)

topfeatures(myDfm, 20)

我想从这个最小的例子中检索："Halle“、"Saale”、"Stadt“、"Deutschland”、“汉堡”、"London“、"Martini”、"James“、"Bond”。

我想我需要一本字典，它定义动词/名词/等等和专有名词(詹姆斯·邦德、汉堡等)，或者在函数/词典中有构建？

附加问题:这个解决方案也适用于英语文本吗？

quanteda

spacy

Stack Overflow用户

回答已采纳

发布于 2017-08-24 14:14:27

你需要一个语音标记师的帮助。幸运的是，有一个很好的例子，有一个德语模型，以spaCy的形式，还有一个我们作为包装器编写的包，spacyr。安装说明在页面。

此代码将执行您想要的操作：

txt <- c("Halle an der Saale ist die grünste Stadt Deutschlands",
         "In Hamburg regnet es immer, das ist also so wie in London.",
         "James Bond trinkt am liebsten Martini")

library("spacyr")
spacy_initialize(model = "de")
txtparsed <- spacy_parse(txt, tag = TRUE, pos = TRUE)

head(txtparsed, 20)
#    doc_id sentence_id token_id        token        lemma   pos   tag entity
# 1   text1           1        1        Halle        halle PROPN    NE  LOC_B
# 2   text1           1        1           an           an   ADP  APPR  LOC_I
# 3   text1           1        1          der          der   DET   ART  LOC_I
# 4   text1           1        1        Saale        saale PROPN    NE  LOC_I
# 5   text1           1        1          ist          ist   AUX VAFIN       
# 6   text1           1        1          die          die   DET   ART       
# 7   text1           1        1      grünste      grünste   ADJ  ADJA       
# 8   text1           1        1        Stadt        stadt  NOUN    NN       
# 9   text1           1        1 Deutschlands deutschlands PROPN    NE  LOC_B
# 10  text2           1        1           In           in   ADP  APPR       
# 11  text2           1        1      Hamburg      hamburg PROPN    NE  LOC_B
# 12  text2           1        1       regnet       regnet  VERB VVFIN       
# 13  text2           1        1           es           es  PRON  PPER       
# 14  text2           1        1        immer        immer   ADV   ADV       
# 15  text2           1        1            ,            , PUNCT    $,       
# 16  text2           1        1          das          das  PRON   PDS       
# 17  text2           1        1          ist          ist   AUX VAFIN       
# 18  text2           1        1         also         also   ADV   ADV       
# 19  text2           1        1           so           so   ADV   ADV       
# 20  text2           1        1          wie          wie  CONJ KOKOM    

(nouns <- with(txtparsed, subset(token, pos == "NOUN")))
# [1] "Stadt"
(propernouns <- with(txtparsed, subset(token, pos == "PROPN")))
# [1] "Halle"        "Saale"        "Deutschlands" "Hamburg"      "London"      
# [6] "James"        "Bond"         "Martini"

在这里，您可以看到您想要的名词在简单的pos字段中标记为“专有名词”。tag字段是一个更详细的德语标记集，您也可以从中选择。

所选名词的列表可用于quanteda中。

library("quanteda")
myDfm <- dfm(txt,  tolower = FALSE, remove_numbers = TRUE, 
             remove = stopwords("german"), remove_punct = TRUE)

head(myDfm)
# Document-feature matrix of: 3 documents, 14 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
#        features
# docs    Halle Saale grünste Stadt Deutschlands Hamburg
#   text1     1     1       1     1            1       0
#   text2     0     0       0     0            0       1
#   text3     0     0       0     0            0       0

head(dfm_select(myDfm, pattern = propernouns))
# Document-feature matrix of: 3 documents, 8 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
#        features
# docs    Halle Saale Deutschlands Hamburg London James
#   text1     1     1            1       0      0     0
#   text2     0     0            0       1      1     0
#   text3     0     0            0       0      0     1

票数 6

查看全部 1 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45857121

复制

相似问题

问用全德达语料库识别名词
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用全德达语料库识别名词EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用全德达语料库识别名词
EN