目前,我有一个在基因组中声明特定基因簇的dataframe,它被定义为一个格式良好的选项卡分隔文件,它看起来基本上与下面的dataframe (示例)类似:
Gene Cluster Genome
-----------------------------
GCF3372 Streptomyces_hygroscopicus
GCF3450 Streptomyces_sp_Hm1069
GCF3371 Streptomyces_sp_MBT13
GCF3371 Streptomyces_xiamenensis在此基础上,我想创建一个缺位/存在表或基于这个值为0和1的数据的应急表,这取决于基因组中某个特定基因簇的缺失或存在。整个想法是让我能够测量特定基因簇在基因组中的出现,因此我想要一个存在/缺席表,以便能够对这个矩阵进行统计分析。
x <- data.frame(gc = c('GCF3372','GCF3450','GCF3371','GCF3371','GCF3371'),
strain = c('Streptomyces_hygroscopicus', 'Streptomyces_sp_Hm1069',
'Streptomyces_sp_MBT13', 'Streptomyces_xiamenensis','Streptomyces_hygroscopicus'))
dput(head(x[, c(1,2)]))发布于 2020-02-03 11:03:51
这是一种从两个分类变量计算应急表的方法。为了说明起见,我将使用sex和height (它们在结构上似乎类似于您在dataframe x中拥有的两个变量):
数据:
set.seed(300)
df <- data.frame(
Height = sample(c("tall", "very tall", "small", "very small"), 20, replace = T),
Sex = sample(c("m", "f"), 20, replace = T)
)
df
Height Sex
1 very tall f
2 very tall m
3 very tall m
4 tall f
5 very small m
6 tall f
7 tall m
8 very small f
9 small f
10 tall m
11 very small f
12 tall m
13 very small m
14 small f
15 very small m
16 small m
17 very small m
18 very small m
19 tall f
20 tall m首先,正如注释中已经指出的,使用table将数据表化
tbl <- table(df$Sex, df$Height); tbl
small tall very small very tall
f 2 3 2 1
m 1 4 5 2然后,可以将tbl的第一行定义为新的向量female,将第二行定义为male。
female <- tbl[1,]
male <- tbl[2,]最后,将这两个行绑定到向量counts中,这是您的应急表:
counts <- rbind(female, male)
counts
small tall very small very tall
female 2 3 2 1
male 1 4 5 2根据应急表,您可以运行您的测试,很可能是一个x-平方:
test <- chisq.test(counts); test
Pearson's Chi-squared test
data: counts
X-squared = 1.3492, df = 3, p-value = 0.7175https://stackoverflow.com/questions/60008606
复制相似问题