我有一个很大的数据帧,df1,看起来像这样:
Gene CB_1.1 CB_10.1 CB_10.2 CB_10.3
1 Gene1 10 0 0 0
2 Gene2 871 7 9 2
3 Gene3 490 2 5 8
4 Gene4 17 5 6 1
5 Gene5 75 1 1 1
6 Gene6 308 2 6 2
> dput(head(df1[,1:5]))
structure(list(X = c("Gene1", "Gene2", "Gene3",
"Gene4", "Gene5", "Gene6"), CB_1.1 = c(10L,
871L, 490L, 17L, 75L, 308L), CB_10.1 = c(0L, 7L, 2L, 5L, 1L,
2L), CB_10.2 = c(0L, 9L, 5L, 6L, 1L, 6L), CB_10.3 = c(0L, 2L,
8L, 1L, 1L, 2L)), row.names = c(NA, 6L), class = "data.frame")第二个数据帧,df2,看起来像这样。
tissue_subcluster Class_2
1 CB_1.1 Neuron
2 CB_10.1 Neuron
3 CB_10.2 Non-Neuron
4 CB_10.3 Non-Neuron
> dput(head(df2[,c(7,9)]))
structure(list(tissue_subcluster = c("CB_1.1", "CB_10.1", "CB_10.2",
"CB_10.3", "CB_11.1", "CB_11.2"), Class_2 = c("Neuron", "Non-Neuron",
"Non-Neuron", "Non-Neuron", "Non-Neuron", "Non-Neuron")), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")我想根据它们是Neuron因子还是df2中的Non-neuron因子对df1中的值进行平均。这样看起来就像这样:
Gene Neuron_mean Non-Neuron_mean
1 Gene1 5 0
2 Gene2 439 5.5
3 Gene3 246 6.2
4 Gene4 11 3.5
5 Gene5 38 1
6 Gene6 155 4 我该怎么做呢?如有任何帮助,我们不胜感激!
发布于 2020-07-15 23:51:41
使用reshape库,
library(reshape)
out <- merge(melt(df1),df2, by.x = "variable", by.y = "tissue_subcluster")
cast(out, Gene~Class_2,mean)给予,
Gene Neuron Non-Neuron
1 Gene1 5 0.0
2 Gene2 439 5.5
3 Gene3 246 6.5
4 Gene4 11 3.5
5 Gene5 38 1.0
6 Gene6 155 4.0发布于 2020-07-16 01:46:50
以下是base R的一个选项。将'df1‘的列名与列’corresponding _subcluster‘进行匹配,获取相应的'Class_2’值,使用该值将'df1‘拆分为list of data.frame,使用sapply在list上循环,然后获取rowMeans
data.frame(Gene = df1$X, sapply(split.default(df1[-1], with(df2,
Class_2[match(names(df1)[-1], tissue_subcluster)])), rowMeans))
# Gene Neuron Non.Neuron
#1 Gene1 5 0.0
#2 Gene2 439 5.5
#3 Gene3 246 6.5
#4 Gene4 11 3.5
#5 Gene5 38 1.0
#6 Gene6 155 4.0数据
df1 <- structure(list(X = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6"), CB_1.1 = c(10L, 871L, 490L, 17L, 75L, 308L), CB_10.1 = c(0L,
7L, 2L, 5L, 1L, 2L), CB_10.2 = c(0L, 9L, 5L, 6L, 1L, 6L), CB_10.3 = c(0L,
2L, 8L, 1L, 1L, 2L)), row.names = c(NA, 6L), class = "data.frame")
df2 <- structure(list(tissue_subcluster = c("CB_1.1", "CB_10.1", "CB_10.2",
"CB_10.3", "CB_11.1", "CB_11.2"), Class_2 = c("Neuron", "Neuron",
"Non-Neuron", "Non-Neuron", "Non-Neuron", "Non-Neuron")), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")发布于 2020-07-15 23:49:14
对于大型数据集,这可能不是最佳方法,但您可以使用tidyr和dplyr
df1 %>%
pivot_longer(cols=-Gene, names_to="tissue_subcluster") %>%
left_join(df2, by="tissue_subcluster") %>%
group_by(Gene, Class_2) %>%
summarise(mean=mean(value)) %>%
pivot_wider(names_from="Class_2", names_glue="{Class_2}_mean", values_from="mean")它会返回
# A tibble: 6 x 3
Gene Neuron_mean `Non-Neuron_mean`
<chr> <dbl> <dbl>
1 0610005C13Rik 5 0
2 0610007P14Rik 439 5.5
3 0610009B22Rik 246 6.5
4 0610009E02Rik 11 3.5
5 0610009L18Rik 38 1
6 0610009O20Rik 155 4https://stackoverflow.com/questions/62918537
复制相似问题