我正在试图找出为什么我的regex命令工作,但另一个不能工作。新的行垃圾中有一致性,这是由于刮擦造成的,所以我尽可能地利用了这一点:
"\n\tMenghe a'Nyam\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-5, 215lb (196cm,
97kg) \n \n\n \n\n \n \n \n\n School: Canisius\n\n\n\n\n\n More player info\n\n\n\n\n\n"
"\n\tJordan Aaberg\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Guard\n\n\n\n 6-9, 225lb (206cm,
102kg) \n \n\n Hometown: Rothsay, MN\n\n\n\n \n\n High School: Rothsay\n\n\n\n \n \n \n\n
School: North Dakota State\n\n\n\n\n\n More player info\n\n\n\n\n\n"
我的目标是从这样的位置(前锋,后卫)和最重要的是,身高(6-5,6-9,分别)的数据。我成功地担任了以下职务:
test <- df %>%
mutate(position = str_extract(player, "(?<=Position:\n \n ).*?(?=\n\n\n\n \\d-\\d)"))
但是,当我按照类似的查找来添加另一个高值时,它返回NA:
test <- df %>%
mutate(position = str_extract(player, "(?<=Position:\n \n ).*?(?=\n\n\n\n \\d-\\d)")) %>%
mutate(height = str_extract(player, "(?<=\\w+\n\n\n\n ).*?(?=, \\d{3}lb)"))
下面是上面对我的df的前3行调用的结果示例,如果有帮助的话:
structure(list(player = c("\n\tMenghe a'Nyam\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-5, 215lb (196cm, 97kg) \n \n\n \n\n \n \n \n\n School: Canisius\n\n\n\n\n\n More player info\n\n\n\n\n\n" ,
"\n\tJordan Aaberg\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-9, 225lb (206cm, 102kg) \n \n\n Hometown: Rothsay, MN\n\n\n\n \n\n High School: Rothsay\n\n\n\n \n \n \n\n School: North Dakota State\n\n\n\n\n\n More player info\n\n\n\n\n\n" ,
"\n\tKarl Aaker\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-5, 210lb (196cm, 95kg) \n \n\n Hometown: Reno, NV\n\n\n\n \n\n \n \n \n\n School: Portland\n\n\n\n\n\n More player info\n\n\n\n\n\n"
), position = c("Forward", "Forward", "Forward"), height = c(NA_character_,
NA_character_, NA_character_)), row.names = c(NA, 3L), class = "data.frame")
发布于 2020-05-30 21:11:54
您可以在+
之后删除\w
,因为ICU引擎不支持查找后面无限长的字符串匹配模式,并使用\s
匹配任何空白空间:
test <- df %>%
mutate(position = str_extract(player, "(?<=Position:\n \n ).*?(?=\n\n\n\n \\d-\\d)")) %>%
mutate(height = str_extract(player, "(?<=\\w\n{4}\\s{2}).*?(?=,\\s+\\d{3}lb)"))
详细信息
(?<=\w\n{4}\s{2})
-在比赛开始前,应该有一个单词char,然后是4个换行符,然后是任意2个空格字符。.*?
-除行中断字符以外的任何0或多个字符尽可能少(?=,\s+\d{3}lb)
-比赛结束后,应该有一个逗号、一个或多个空格字符、3位数字和lb
子字符串。发布于 2020-05-30 20:50:47
下面是一种使用stringr
和tidyr
的方法。首先,我删除了所有的\n
和\t
,因为它们真的让我很烦。
test <- df %>%
mutate(player = str_replace_all(player, "\n|\r|\t", ""),
position = str_extract(player, "(?<=Position:).+(?=\\s\\d-\\d)"),
height = str_extract(player, "\\d-\\d?(?=,\\s\\d{3}lb)"))
我们可以一步一步地做所有的突变。希望这能解决你的问题。
https://stackoverflow.com/questions/62107864
复制相似问题