PERMANOVA原理解释:这个统计检验可用于判断PCA/PCoA等的分群效果是否显著!
配对检验:画一个带统计检验的PcOA分析结果 (再进一步,配对比较)
你的adonis用对了吗?不同因素的顺序竟然对结果有很大影响
非参数检验也不是什么都不需要关注,比如上面提到的因素顺序
和方差加和
方式是一个需要注意的点。除此之外,非参数多元方差分析应用时还有下面这些注意事项:
PERMANOVA
检验没有考虑变量之间的共线性关系,因此也不能够用于探索这种关系。Nested or hierarchical designs
)时需要考虑合适的置换策略。
需要明确哪些样品是真正可以交换的 (exchangeable
)。PERMANOVA
有个假设是balanced designs
(不同分组的样本数相等), 非平衡设计也能处理。前面我们用下面的代码检验了Managment
对物种组成差异影响的显著程度,获得P-value=0.002 < 0.05
,表示管理方式对物种组成有显著影响。但这一影响是否受到每个分组里面数据离散程度的影响呢?
library(vegan)
data(dune)
data(dune.env)
# 基于bray-curtis距离进行计算
set.seed(1)
dune.div <- adonis2(dune ~ Management, data = dune.env, permutations = 999, method="bray")
dune.div
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = dune ~ Management, data = dune.env, permutations = 999, method = "bray")
## Df SumOfSqs R2 F Pr(>F)
## Management 3 1.4686 0.34161 2.7672 0.002 **
## Residual 16 2.8304 0.65839
## Total 19 4.2990 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
我们还需要利用betadisper
评估下每组样本物种组成的多元一致性 (Multivariate homogeneity of groups dispersions (variances)
)。如下代码计算出P=0.168
表示不同分组样品检测指标的离散度(方差)没有显著差异。那么,adonis
检测出的差异就是因为每组数据在空间的中心点不同造成的,进一步说明Management
对物种组成有显著影响。
# 计算加权bray-curtis距离
dune.dist <- vegdist(dune, method="bray", binary=F)
# One measure of multivariate dispersion (variance) for a group of samples
# is to calculate the average distance of group members to the group centroid
# or spatial median in multivariate space.
# To test if the dispersions (variances) of one or more groups are different,
# the distances of group members to the group centroid are subject to ANOVA.
# This is a multivariate analogue of Levene's test for homogeneity of variances
# if the distances between group members and
# group centroids is the Euclidean distance.
dispersion <- betadisper(dune.dist, group=dune.env$Management)
permutest(dispersion)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 3 0.13831 0.046104 1.9506 999 0.159
## Residuals 16 0.37816 0.023635
从下面的图上也可以看出,4种管理方式下样品在空间的中心点相距较远。(也可以参考前面如何美化这个图)
plot(dispersion, hull=FALSE, ellipse=TRUE) ##sd ellipse
Q: When running adonis (vegan package) I got an
r2 = 0.45
, andp = 0.001
. When I ran the betadisper and ran a subsequent permutation test I got anF = 1
andp = 0.3
. A: A non-significant result inbetadisper
is not necessarily related to a significant/non-significant result inadonis
. The two tests are testing different hypothesis. The former testshomogeneity of dispersion
among groups (regions in your case), which is a condition (assumption) for adonis. The latter tests no difference in ‘location
’, that is, tests whethercomposition among groups
is similar or not. You may have thecentroids
of two groups in NMS at a very similar position in the ordination space, but if theirdispersions
are quite different, adonis will give you a significant p-value, thus, the result is heavily influenced not by thedifference in composition between groups
but bydifferences in composition within groups
(heterogeneous dispersion, and thus a measure of beta diversity). In short, your results are fine, you are meeting the ‘one assumption’ for adonis (homogeneous dispersion) and thus you are certain that results from adonis are ‘real’ and not an artifact of heterogeneous dispersions. For more information you can read Anderson (2006) Biometrics 62(1):245-253 and Anderson (2006) Ecology Letters 9(6):683-693. Hope this helps! https://stats.stackexchange.com/questions/212137/betadisper-and-adonis-in-r-am-i-interpreting-my-output-correctly
下面我们看一个模拟的例子,模拟出3套群体的物种丰度表,群体1、群体2、群体3的物种空间的中心点一致,而物种丰度的离散度依次变小,PERMANOVA
检验显著,betadisper
结果也显著,这时解释数据时就要小心。这个导致显著的原因是什么。
set.seed(1)
num <- 30
# 控制每个物种的均值
mean <- seq(10,120,by=10)
# 控制离散度
disp <- c(5,50,200)
# 模拟3组样品的数据;直接是转置后的物种丰度表
sites.a <- as.data.frame(mapply(rnbinom, n=num, size=disp[1], mu=mean))
rownames(sites.a) <- paste('site.a', 1:num, sep=".")
colnames(sites.a) <- paste('Species',letters[1:length(mean)], sep=".")
sites.b <- as.data.frame(mapply(rnbinom, n=num, size=disp[1:2], mu=mean))
rownames(sites.b) <- paste('site.b', 1:num, sep=".")
colnames(sites.b) <- paste('Species',letters[1:length(mean)], sep=".")
sites.c <- as.data.frame(mapply(rnbinom, n=num, size=disp, mu=mean))
rownames(sites.c) <- paste('site.c', 1:num, sep=".")
colnames(sites.c) <- paste('Species',letters[1:length(mean)], sep=".")
otu_table_t <- rbind(sites.a,sites.b,sites.c)
otu_table_t[sample(1:90,5),]
## Species.a Species.b Species.c Species.d Species.e Species.f Species.g Species.h Species.i Species.j
## site.c.22 13 15 43 29 49 72 24 102 75 96
## site.a.26 8 23 46 29 25 15 91 49 58 54
## site.a.13 14 30 47 56 18 77 111 128 90 53
## site.a.14 5 15 17 56 37 75 81 59 63 58
## site.b.21 15 24 8 33 28 42 108 74 76 64
## Species.k Species.l
## site.c.22 139 142
## site.a.26 87 129
## site.a.13 33 47
## site.a.14 164 183
## site.b.21 52 103
生成Metadata数据,包含样品的分组信息。目的就是检验不同组的物种构成是否有显著差异。
metadata <- data.frame(Sample=rownames(otu_table_t), Group=rep(c("A","B","C"), each=num))
rownames(metadata) <- metadata$Sample
metadata[sample(1:90,5),,drop=F]
## Sample Group
## site.a.28 site.a.28 A
## site.b.12 site.b.12 B
## site.a.20 site.a.20 A
## site.b.10 site.b.10 B
## site.a.10 site.a.10 A
统计分析前,先直观的看一下不同组样本在物种定义的空间上的分布。
为什么要画个图:参考 - 什么是安斯库姆四重奏?为什么统计分析之前必须要作图?
# 计算加权bray-curtis距离
otu_dist <- vegdist(otu_table_t, method="bray", binary=F)
otu_pcoa <- cmdscale(otu_dist, k=3, eig=T)
otu_pcoa_points <- as.data.frame(otu_pcoa$points)
sum_eig <- sum(otu_pcoa$eig)
eig_percent <- round(otu_pcoa$eig/sum_eig*100,1)
colnames(otu_pcoa_points) <- paste0("PCoA", 1:3)
otu_pcoa_result <- cbind(otu_pcoa_points, metadata)
从PCoA的结果上来看,A,B,C三个组在第一、第二、第三主坐标轴没有明显的区分开。
library(ggplot2)
library(patchwork)
ggplot(otu_pcoa_result, aes(x=PCoA1, y=PCoA2, color=Group)) +
labs(x=paste("PCoA 1 (", eig_percent[1], "%)", sep=""),
y=paste("PCoA 2 (", eig_percent[2], "%)", sep="")) +
geom_point(size=4) + stat_ellipse(level=0.9) +
theme_classic() + coord_fixed() +
ggplot(otu_pcoa_result, aes(x=PCoA1, y=PCoA3, color=Group)) +
labs(x=paste("PCoA 1 (", eig_percent[1], "%)", sep=""),
y=paste("PCoA 3 (", eig_percent[3], "%)", sep="")) +
geom_point(size=4) + stat_ellipse(level=0.9) +
theme_classic() + coord_fixed()
从NMDS结果看,A,B,C三组也区分不开。
otu_mds <- metaMDS(otu_table_t, k=5) #using all the defaults
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1131245
## Run 1 stress 0.1131233
## ... New best solution
## ... Procrustes: rmse 0.0003155417 max resid 0.001341899
## ... Similar to previous best
## Run 2 stress 0.1131243
## ... Procrustes: rmse 0.0009154324 max resid 0.00352237
## ... Similar to previous best
## Run 3 stress 0.1131238
## ... Procrustes: rmse 0.0002307456 max resid 0.001378836
## ... Similar to previous best
## Run 4 stress 0.1131239
## ... Procrustes: rmse 0.0002008885 max resid 0.0008441584
## ... Similar to previous best
## Run 5 stress 0.1131233
## ... Procrustes: rmse 0.0004594988 max resid 0.00248363
## ... Similar to previous best
## Run 6 stress 0.1136538
## Run 7 stress 0.1131231
## ... New best solution
## ... Procrustes: rmse 6.187922e-05 max resid 0.0002788433
## ... Similar to previous best
## Run 8 stress 0.1131234
## ... Procrustes: rmse 0.000457399 max resid 0.002017475
## ... Similar to previous best
## Run 9 stress 0.1131243
## ... Procrustes: rmse 0.0003620819 max resid 0.001329571
## ... Similar to previous best
## Run 10 stress 0.1131235
## ... Procrustes: rmse 0.0001788438 max resid 0.0008840311
## ... Similar to previous best
## Run 11 stress 0.1131248
## ... Procrustes: rmse 0.0004674201 max resid 0.001960981
## ... Similar to previous best
## Run 12 stress 0.1131231
## ... New best solution
## ... Procrustes: rmse 0.0003807188 max resid 0.001578129
## ... Similar to previous best
## Run 13 stress 0.1131238
## ... Procrustes: rmse 0.0004016239 max resid 0.002178598
## ... Similar to previous best
## Run 14 stress 0.113123
## ... New best solution
## ... Procrustes: rmse 0.0001931854 max resid 0.0007886561
## ... Similar to previous best
## Run 15 stress 0.1176584
## Run 16 stress 0.1131244
## ... Procrustes: rmse 0.000621146 max resid 0.002339344
## ... Similar to previous best
## Run 17 stress 0.1131237
## ... Procrustes: rmse 0.0004553297 max resid 0.0019548
## ... Similar to previous best
## Run 18 stress 0.1131236
## ... Procrustes: rmse 0.000454603 max resid 0.001894929
## ... Similar to previous best
## Run 19 stress 0.1131241
## ... Procrustes: rmse 0.0005855289 max resid 0.002455173
## ... Similar to previous best
## Run 20 stress 0.113124
## ... Procrustes: rmse 0.0005247607 max resid 0.001899271
## ... Similar to previous best
## *** Solution reached
otu_mds_scores <- as.data.frame(scores(otu_mds))
otu_mds_scores <- cbind(otu_mds_scores, metadata)
library(ggplot2)
ggplot(data=otu_mds_scores, aes(x=NMDS1,y=NMDS2,colour=Group)) +
geom_point(size=4) +
stat_ellipse(level = 0.9) +
theme_classic()
adonis
结果显示Pr(>F)
<0.05,统计显著;不同组之间的物种组成存在显著差异。这与PCoA
和NMDS
的结果还是有些不一致的。那这个统计差异是怎么来的呢?
library(vegan)
adon.results<-adonis(otu_dist ~ Group, data=metadata, perm=999)
print(adon.results)
##
## Call:
## adonis(formula = otu_dist ~ Group, data = metadata, permutations = 999)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## Group 2 0.10752 0.053760 2.4707 0.05375 0.001 ***
## Residuals 87 1.89300 0.021759 0.94625
## Total 89 2.00052 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
betadisper
检验Pr(>F)<0.05
表明不同组的数据在空间分布的离散度显著不同。这是导致adonis
结果显著的主要原因。不同分组之间物种的构成的显著不同不是体现在物种空间中心点的变化,而是物种空间离散度的变化。
mod <- betadisper(otu_dist, metadata$Group)
permutest(mod)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 2 0.157498 0.078749 80.188 999 0.001 ***
## Residuals 87 0.085439 0.000982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
把每组样本抽提出来,分别绘制下PCoA的样品分布,可以看出,每组样品在PCoA定义的空间上中心点是很相近的,而样品分散程度不同。也就是说分组内样品的多样性反应到了不同分组的物种构成差异上了,这个“显著”的差异是不是我们关注的,需要自己来判断了。
# extract the centroids and the site points in multivariate space.
centroids<-data.frame(grps=rownames(mod$centroids),data.frame(mod$centroids))
vectors<-data.frame(group=mod$group,data.frame(mod$vectors))
# to create the lines from the centroids to each point we will put it in a format that ggplot can handle
seg.data<-cbind(vectors[,1:3],centroids[rep(1:nrow(centroids),as.data.frame(table(vectors$group))$Freq),2:3])
names(seg.data)<-c("group","v.PCoA1","v.PCoA2","PCoA1","PCoA2")
# create the convex hulls of the outermost points
grp1.hull<-seg.data[seg.data$group=="A",1:3][chull(seg.data[seg.data$group=="A",2:3]),]
grp2.hull<-seg.data[seg.data$group=="B",1:3][chull(seg.data[seg.data$group=="B",2:3]),]
grp3.hull<-seg.data[seg.data$group=="C",1:3][chull(seg.data[seg.data$group=="C",2:3]),]
all.hull<-rbind(grp1.hull,grp2.hull,grp3.hull)
library(gridExtra)
panel.a<-ggplot() +
geom_polygon(data=all.hull[all.hull=="A",],aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
geom_segment(data=seg.data[1:30,],aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) +
geom_point(data=centroids[1,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=16) +
geom_point(data=seg.data[1:30,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=16) +
labs(title="A",x="",y="") +
coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
theme_classic() +
theme(legend.position="none")
panel.b<-ggplot() +
geom_polygon(data=all.hull[all.hull=="B",],aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
geom_segment(data=seg.data[31:60,],aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) +
geom_point(data=centroids[2,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=17) +
geom_point(data=seg.data[31:60,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=17) +
labs(title="B",x="",y="") +
coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
theme_classic() +
theme(legend.position="none")
panel.c<-ggplot() +
geom_polygon(data=all.hull[all.hull=="C",],aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
geom_segment(data=seg.data[61:90,],aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) +
geom_point(data=centroids[3,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=15) +
geom_point(data=seg.data[61:90,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=15) +
labs(title="C",x="",y="") +
coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
theme_classic() +
theme(legend.position="none")
panel.d<-ggplot() +
geom_polygon(data=all.hull,aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
geom_segment(data=seg.data,aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) +
geom_point(data=centroids[,1:3], aes(x=PCoA1,y=PCoA2,shape=grps),size=4,colour="red") +
geom_point(data=seg.data, aes(x=v.PCoA1,y=v.PCoA2,shape=group),size=2) +
labs(title="All",x="",y="") +
coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
theme_classic() +
theme(legend.position="none")
grid.arrange(panel.a,panel.b,panel.c,panel.d,nrow=1)
PERMANOVA
的作者对这个问题的看法Marti Anderson: “[…] Although there is also no explicit assumption regarding the homogeneity of spread within each group, PERMANOVA, like ANOSIM (Clarke 1993), will be sensitive to differences in spread (variability) among groups. Thus, if a significant difference between groups is detected using PERMANOVA, then this could be due to differences in
location
,differences in spread
, or acombination
of the two. Perhaps the best approach is to perform a separate test for homogeneity (e.g., using the program PERMDISP) including pair-wise comparisons, as well as examining the average within and between-group distances and associated MDS plots. This will help to determine the nature of the difference between any pair of groups, whether it be due to location, spread, or a combination of the two. […]”