问题:
在R中转换data.table
的最佳方法是什么:
> input
id value node
1: 1 foo node3
2: 2 bar node[2,4]
3: 3 qux node[2-4]
4: 4 foo node[1-2,4]
变成类似这样的东西:
> output
id value node
1: 1 foo node3
2: 2 bar node2
3: 2 bar node4
4: 3 qux node2
5: 3 qux node3
6: 3 qux node4
7: 4 foo node1
8: 4 foo node2
9: 4 foo node4
示例输入和输出:
input <- data.table(id = c(1,2,3,4), value = c("foo", "bar", "qux", "foo"), node = c("node3","node[2,4]","node[2-4]","node[1-2,4]"))
output <- data.table(id = c(1,2,2,3,3,3,4,4,4), value = c("foo","bar","bar","qux","qux","qux","foo","foo","foo"), node = c("node3", "node2", "node4", "node2", "node3", "node4", "node1", "node2", "node4"))
背景:
我正在从一个机器集群中提取作业日志,日志与上面的输入类似。id对应于作业id,值对应于特定的可执行文件,节点对应于集群中实际执行作业的机器。日志使用节点列的压缩格式来表示作业在哪台计算机上运行。
使用library(stringr)
,我编写了一些丑陋的代码,它们将部分解析节点列。也许这可以是一个有用的起点:
expand_node <- function(nodes)
{
tokens <- str_match(nodes, "\\[([0-9,\\-]+)\\]")[ ,2]
tokens <- str_replace_all(tokens, "\\-", ":")
tokens <- paste0("c(",tokens,")")
result <- lapply(tokens, function(expr) eval(parse(text = expr)))
return(result)
}
发布于 2016-08-16 22:26:34
以下是您可以尝试的data.table
选项,使用正则表达式可以减少一步:
input[, .(node = unlist(lapply(sub("node\\[?([0-9,:]+)\\]?", "c(\\1)", gsub("-", ":", node)),
function(expr) paste("node", eval(parse(text = expr)), sep = "")))), .(id, value)]
# id value node
#1: 1 foo node3
#2: 2 bar node2
#3: 2 bar node4
#4: 3 qux node2
#5: 3 qux node3
#6: 3 qux node4
#7: 4 foo node1
#8: 4 foo node2
#9: 4 foo node4
发布于 2016-08-16 22:38:51
下面是在更改‘cSplit
’列之后使用节点的选项
library(stringr)
library(splitstackshape)
library(gsubfn)
input[, node := lapply(str_extract_all(gsubfn("(\\d+)-(\\d+)",
~seq(as.numeric(x), as.numeric(y), by = 1), node), "[0-9]+"),
function(x) paste0("node", x, collapse=","))]
cSplit(input, "node", ",", "long")
# id value node
#1: 1 foo node3
#2: 2 bar node2
#3: 2 bar node4
#4: 3 qux node2
#5: 3 qux node3
#6: 3 qux node4
#7: 4 foo node1
#8: 4 foo node2
#9: 4 foo node4
https://stackoverflow.com/questions/38976564
复制相似问题