给定此data.table
library(data.table)
dt <- data.table(f1 = c(
"stuffstuff-0000097125",
"stuffstuff.abc.0006496679",
"stuffstuff0007517235",
"stuffstuff_xyz.0007280719",
"stuffstuff0005995303",
"stuffstuff_a1b_0000143856",
"stuffstuff0009362407",
"stuffstuff.c44_0009735298"
))
我想要得到以下结果:
f1 parsed_val
1: stuffstuff-0000097125
2: stuffstuff.abc.0006496679 abc
3: stuffstuff0007517235
4: stuffstuff_xyz.0007280719 xyz
5: stuffstuff0005995303
6: stuffstuff_a1b_0000143856 a1b
7: stuffstuff0009362407
8: stuffstuff.c44_0009735298 c44
这是我尝试过的:
rex_pattern <- "(?<=(\\.|\\_|\\-))[A-Za-z0-9]{3}(?=(\\.|\\_|\\-)[0-9]{3,})"
dt[, `:=`(parsed_val = regmatches(f1, regexpr(pattern = rex_pattern, f1, perl = TRUE)))]
然而,由于回收,以下是我得到的结果:
f1 parsed_val
1: stuffstuff-0000097125 abc
2: stuffstuff.abc.0006496679 xyz
3: stuffstuff0007517235 a1b
4: stuffstuff_xyz.0007280719 c44
5: stuffstuff0005995303 abc
6: stuffstuff_a1b_0000143856 xyz
7: stuffstuff0009362407 a1b
8: stuffstuff.c44_0009735298 c44
我尝试在函数中使用ifelse
来返回空字符串:
getMmFromFilename <- function(my_file_name){
rex_pattern <- "(?<=(\\.|\\_|\\-))[A-Za-z0-9]{3}(?=(\\.|\\_|\\-)[0-9]{3,})"
nothing_found <- character(length = 0)
mm <- regmatches(my_file_name, regexpr(pattern = rex_pattern, my_file_name, perl = TRUE))
ifelse(identical(mm, nothing_found), "missing_Mm", mm)
}
dt[, .(parsed_val = getMmFromFilename(f1))]
但这只返回了abc
的1个值。regmatches
的documentation说:“对于向量匹配数据(从regexpr获得),空匹配被丢弃;对于列表匹配数据,空匹配提供空组件(零长度字符向量)。”我猜解决方案就在这里,但我还没能找到……
至于解决方案,我的工作流程要求我使用data.table
,简单解释一下解决方案会有很大帮助……
提前谢谢。
发布于 2018-06-09 05:25:35
dt[,parser_val:=sub(".*?[._](.*)[._].*|.*","\\1",f1)]
dt
f1 parser_val
1: stuffstuff-0000097125
2: stuffstuff.abc.0006496679 abc
3: stuffstuff0007517235
4: stuffstuff_xyz.0007280719 xyz
5: stuffstuff0005995303
6: stuffstuff_a1b_0000143856 a1b
7: stuffstuff0009362407
8: stuffstuff.c44_0009735298 c44
如果要使用regmatches
,可以将pattern="(?<=[._]).*(?=[._])|$"
与perl=TRUE
一起使用
dt[,parser_val:=regmatches(dt$f1,regexpr("(?<=[._]).*(?=[._])|$",dt$f1,perl = T))]
> dt
f1 parser_val
1: stuffstuff-0000097125
2: stuffstuff.abc.0006496679 abc
3: stuffstuff0007517235
4: stuffstuff_xyz.0007280719 xyz
5: stuffstuff0005995303
6: stuffstuff_a1b_0000143856 a1b
7: stuffstuff0009362407
8: stuffstuff.c44_0009735298 c44
https://stackoverflow.com/questions/50768500
复制相似问题