我有一个数值的数据集,每个数值代表一个区域。
例如:
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)
我需要确定数据中是否有重复的子序列,即对象是否重复地从1区到2区再到3区。在上面的示例中,1,2,3将给出值3。我还不知道子序列,我需要R来提供这些数据。
然后,我需要计算这个子序列在数据中出现的次数。
非常基础的知识或R所以请原谅我的无知,如果这是一个简单的任务!
发布于 2018-08-13 01:14:14
这里有一种方法可以找出哪些长度的n
序列是重复的,以及重复多少次
对于n = 3
library(tidyverse) # not necessary, see base version below
n <- 3
lapply(seq(0, length(x) - n), `+`, seq(n)) %>% # get index of all subsequences
map_chr(~ paste(x[.], collapse = ',')) %>% # paste together as character
table %>% # get number of times each occurs
`[`(. > 1) # select sequences occurring > 1 time
# 1,2,3
# 3
对于n = 2
n <- 2
lapply(seq(0, length(x) - n), `+`, seq(n)) %>%
map_chr(~ paste(x[.], collapse = ',')) %>%
table %>%
`[`(. > 1)
# 1,2 2,3 5,9
# 3 3 2
没有Tidyverse
seqs <- lapply(seq(0, length(x) - n), `+`, seq(n))
seqs.char <- sapply(seqs, function(i) paste(x[i], collapse = ','))
tbl <- table(seqs.char)
tbl[tbl > 1]
我将添加我自己的问题:谁知道如何在不先转换为字符的情况下做到这一点?例如fun
,fun(list(1:2, 1:2, 2:3))
告诉你1:2
出现两次,2:3
出现一次?
发布于 2018-08-13 03:26:05
一种替代的tidyverse
方法,它根据您希望子序列具有多少值来创建结果的大型数据帧:
library(tidyverse)
# example vector
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)
# function that gets as input number of consequtive elements in a subsequence
# and returns an ordered dataframe by counts of occurence
f = function(n) {
data.frame(value = x) %>% # get the vector x
slice(1:(nrow(.)-n+1)) %>% # remove values not needed from the end
mutate(position = row_number()) %>% # add position of each value
rowwise() %>% # for each value/row
mutate(vec = paste0(x[position:(position+n-1)], collapse = ",")) %>% # create subsequences as a string
ungroup() %>% # forget the grouping
count(vec, sort = T) } # order by counts descending
2:5 %>% # specify how many values in your subsequences you want to investigate (let's say from 2 to 5)
map_df(~ data.frame(NumElements = ., f(.))) %>% # apply your function and keep the number values
arrange(desc(n)) %>% # order by counts descending
tbl_df() # (only for visualisation purposes)
# # A tibble: 88 x 3
# NumElements vec n
# <dbl> <chr> <int>
# 1 2 1,2 3
# 2 2 2,3 3
# 3 3 1,2,3 3
# 4 2 5,9 2
# 5 2 1,6 1
# 6 2 10,1 1
# 7 2 10,7 1
# 8 2 3,10 1
# 9 2 3,4 1
# 10 2 4,1 1
# # ... with 78 more rows
发布于 2018-08-14 21:36:54
下面的方法查找任意长度的序列(k
):将输入向量转换为具有k
行的矩阵;这是通过在开头添加0:(k-1)
NA's
进行k
次完成的。最后,对这些k
矩阵中的所有行进行计数(将元素paste
在一起):
frs <- function(x, k=2){
padit <- function(.) c(.,rep(NA, k-length(.)%%k))
xx <- lapply(1:k, function(iii) padit(c(rep(NA,iii-1), x)))
xx <- do.call(rbind, lapply(xx, function(.) matrix(., ncol=k, byrow=TRUE)))
xx <- sapply(split(xx, 1:NROW(xx)), paste, collapse=",")
(function(x) x[x>1])(table(xx))
}
输出:
> frs(x,2)
xx
1,2 2,3 5,9
3 3 2
> frs(x,3)
1,2,3
3
> frs(x,4)
named integer(0)
https://stackoverflow.com/questions/51810821
复制相似问题