我有一个如下的文本。
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)当我使用tidytext和下面的代码进行标记时,
AA <- df %>%
mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
locations = str_locate_all(df$Section, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)它给了我令牌,开始和结束的位置。如何在解嵌的同时获取POS标签。如下所示(下图中的POStags可能不正确)

发布于 2018-08-15 23:28:55
你可以使用udpipe包来获取你的POS数据。Udpipe自动标记标点符号。
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section, stringAsFactors = FALSE)
library(udpipe)
library(dplyr)
udmodel <- udpipe_download_model(language = "english")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel,
df$Section)
x <- as.data.frame(x)
x %>% select(token, upos)
token upos
1 If SCONJ
2 an DET
3 infusion NOUN
4 reaction NOUN
5 occurs NOUN
6 , PUNCT
7 interrupt VERB
8 the DET
9 infusion NOUN
10 . PUNCT现在将其与您请求的previous question的结果组合在一起。我取了其中一个答案。
library(stringr)
library(purrr)
library(tidyr)
df %>% mutate(
tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations) %>%
mutate(POS = purrr::map_chr(tokens, function(x) as.data.frame(udpipe_annotate(udmodel, x = x, tokenizer = "vertical"))$upos))
tokens start end upos
1 If 1 2 SCONJ
2 an 4 5 DET
3 infusion 7 14 NOUN
4 reaction 16 23 NOUN
5 occurs 25 30 NOUN
6 , 31 31 PUNCT
7 interrupt 33 41 VERB
8 the 43 45 DET
9 infusion 47 54 NOUN
10 . 55 55 PUNCT编辑:更好的解决方案
但最好的解决方案是从udpipe开始,然后再做剩下的事情。请注意,我使用的是stringr i而不是stringr包。stringr基于stringi,但是stringi有更多的选项。
x <- udpipe_annotate(udmodel, x = df$Section)
x %>%
as_data_frame %>%
select(token, POSTag = upos) %>% # select needed columns
# add start/end locations
mutate(locations = map(token, function(x) data.frame(stringi::stri_locate(df$Section, fixed = x)))) %>%
unnest
# A tibble: 10 x 4
token POSTag start end
<chr> <chr> <int> <int>
1 If SCONJ 1 2
2 an DET 4 5
3 infusion NOUN 7 14
4 reaction NOUN 16 23
5 occurs NOUN 25 30
6 , PUNCT 31 31
7 interrupt VERB 33 41
8 the DET 43 45
9 infusion NOUN 7 14
10 . PUNCT 55 55发布于 2018-09-25 20:59:46
仅供参考。从CRAN上的udpix0.7版本开始,您可以按如下所示进行操作。
library(udpipe)
x <- data.frame(doc_id = c("doc1", "doc2"),
text = c("If an infusion reaction occurs, interrupt the infusion.",
"Houston we have a problem"))
x <- udpipe(x, "english")
x这将为您提供(请注意start/end以及您正在查找的/upos/xpos标记):
doc_id paragraph_id sentence_id start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 1 2 1 1 If if SCONJ IN <NA> 7 mark <NA> <NA>
doc1 1 1 4 5 2 2 an a DET DT Definite=Ind|PronType=Art 5 det <NA> <NA>
doc1 1 1 7 14 3 3 infusion infusion NOUN NN Number=Sing 4 compound <NA> <NA>
doc1 1 1 16 23 4 4 reaction reaction NOUN NN Number=Sing 5 compound <NA> <NA>
doc1 1 1 25 30 5 5 occurs occur NOUN NNS Number=Plur 7 nsubj <NA> SpaceAfter=No
doc1 1 1 31 31 6 6 , , PUNCT , <NA> 7 punct <NA> <NA>
doc1 1 1 33 41 7 7 interrupt interrupt VERB VB Mood=Imp|VerbForm=Fin 0 root <NA> <NA>
doc1 1 1 43 45 8 8 the the DET DT Definite=Def|PronType=Art 9 det <NA> <NA>
doc1 1 1 47 54 9 9 infusion infusion NOUN NN Number=Sing 7 obj <NA> SpaceAfter=No
doc1 1 1 55 55 10 10 . . PUNCT . <NA> 7 punct <NA> SpacesAfter=\\n
doc2 1 1 1 7 1 1 Houston Houston PROPN NNP Number=Sing 0 root <NA> <NA>
doc2 1 1 9 10 2 2 we we PRON PRP Case=Nom|Number=Plur|Person=1|PronType=Prs 3 nsubj <NA> <NA>
doc2 1 1 12 15 3 3 have have VERB VBP Mood=Ind|Tense=Pres|VerbForm=Fin 1 parataxis <NA> <NA>
doc2 1 1 17 17 4 4 a a DET DT Definite=Ind|PronType=Art 5 det <NA> <NA>
doc2 1 1 19 25 5 5 problem problem NOUN NN Number=Sing 3 obj <NA> SpacesAfter=\\n发布于 2018-08-15 23:37:56
就像前面的回答一样,我认为udpipe可能是POS标签最简单的方式。我最喜欢的与udpipe交互的方式是通过cleanNLP包。在调用初始化函数之后,只需两行代码即可获得udpipe输出。
library(tidyverse)
library(cleanNLP)
cnlp_init_udpipe()
#> Loading required namespace: udpipe
df <- data_frame(id = 1,
text = c("If an infusion reaction occurs, interrupt the infusion."))
cnlp_annotate(df) %>%
cnlp_get_tif()
#> # A tibble: 10 x 19
#> id sid tid word lemma upos pos cid pid definite mood
#> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl> <int> <chr> <chr>
#> 1 1 1 1 If if SCONJ IN 0 1 <NA> <NA>
#> 2 1 1 2 an a DET DT 3 1 Ind <NA>
#> 3 1 1 3 infu… infu… NOUN NN 6 1 <NA> <NA>
#> 4 1 1 4 reac… reac… NOUN NN 15 1 <NA> <NA>
#> 5 1 1 5 occu… occur NOUN NNS 24 1 <NA> <NA>
#> 6 1 1 6 , , PUNCT , 30 1 <NA> <NA>
#> 7 1 1 7 inte… inte… VERB VB 32 1 <NA> Imp
#> 8 1 1 8 the the DET DT 42 1 Def <NA>
#> 9 1 1 9 infu… infu… NOUN NN 46 1 <NA> <NA>
#> 10 1 1 10 . . PUNCT . 54 1 <NA> <NA>
#> # ... with 8 more variables: number <chr>, pron_type <chr>,
#> # verb_form <chr>, source <int>, relation <chr>, word_source <chr>,
#> # lemma_source <chr>, spaces <dbl>由reprex package创建于2018-08-15 (v0.2.0)。
https://stackoverflow.com/questions/51861346
复制相似问题