欢迎大家关注全网生信学习者系列:
微生物数据具有一下的特点,这使得在做差异分析的时候需要考虑到更多的问题,
现在 Nearing, Douglas et al. Nature Comm. Microbiome differential abundance methods produce different results across 38 datasets.文章对常用的差异分析方法做了基准测试,本文将不同方法的核心代码记录下来。
作者这项工作的目标是比较一系列常见的差异分析(DA)方法在16S rRNA基因数据集上的表现。因此,在这项工作的大部分中,作者并不确切知道正确答案是什么:作者主要只能说哪些工具在性能上更相似于其他工具。然而,还包含了几项分析,以帮助评估这些工具在不同环境下来自同一数据集的假阳性率和一致性。 简单地说,该工作试图评估不同的微生物组差异丰度分析方法在多个数据集上的表现,并比较它们之间的相似性和一致性,同时尝试评估这些工具在不同数据集上产生假阳性结果的频率。
即使在同一环境中,不同样本的微生物出现概率或者丰度都是不一样的,大部分微生物丰度极低。又因为在测序仪的检测极限下,微生物丰度(相对或绝对丰度)为0的概率又极大增加了。除此之外,比对所使用的数据库大小也即是覆盖物种率也会对最终的微生物丰度表达谱有较大的影响。最后我们所获得的微生物丰度谱必然含有大量的零值,它有两种情况,一种是真实的零值,另一种是误差导致的零值。很多算法会针对这两个特性构建不同的处理零值策略。
零值数量的大小构成了微生物丰度谱稀疏性。在某次16s数据的OTU水平中,零值比例高达80%以上。Sparsity属性导致常用的数据分析方法如t-test/wilcox-test假设检验方法均不适合。为了解决sparsity对分析的影响,很多R包的方法如ANCOM的Zero划分,metagenomeSeq的ZIP/ZILN对Zero进行处理,处理后的矩阵再做如CLR等变换,CLR变换又是为了处理微生物数据另一个特点compositional (下一部分讲)。最后转换后的数据会服从常见的分布,也即是可以使用常见的如Wilcox/t-test之类(两分组)的方法做假设检验,需要说明的是ANCOM还会根据物种在样本内的显著性的差异比例区分差异物种,这也是为何ANCOM的稳健性的原因。
Compositional的数据特性是服从simplex空间,简而言之是指:某个样本内所有微生物的加和是一个常数(可以是1也可以是10,100)。符合该属性的数据内部元素之间存在着相关关系,即某个元素的比例发生波动,必然引起其他元素比例的波动,但在实际的微生物环境中,这种关联关系可能是不存在的。为了解决compositional的问题,有人提出了使用各种normalization方法(比如上文提到的CLR:$X{i}=log(\frac{x{i}}{GeametricMean(X)})$,我暂时只熟悉这个方法)。
Compositional数据不服从欧式空间分布,在使用log-ratio transformation后,数据可以一一对应到真实的多维变量的空间,方便后续应用标准分析方法。
Overdispersion的条件是 Variance >> mean,也就是说数据的方差要远远大于均值。常用的适合count matrix的Poisson分布是无法处理这样的数据的,因此现在很多方法都是用负二项分布去拟合数据。
使用一张自己讲过的PPT总结一下。
不同的差异分析方法识别到差异微生物可能会存在较大的区别,这是因为这些方法的原理是不一样的,但从微生物的数据特点而言,方法需要符合微生物数据特性。
ALDEx2(ANOVA-Like Differential Expression 2)是一种用于微生物组数据差异分析的方法,它特别适用于处理组成数据(compositional data),这类数据的特点是在每个样本中各部分的总和为一个固定值,例如微生物群落中各物种的相对丰度之和为1。ALDEx2方法的核心原理包括以下几个步骤:
#### Script to Run ALDEX2 differential abundance
deps = c("ALDEx2")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ALDEx2")
}
library(dep, character.only = TRUE)
}
library(ALDEx2)
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
results <- aldex(reads = ASV_table,
conditions = groupings[, 1],
mc.samples = 128,
test = "t",
effect = TRUE,
include.sample.summary = FALSE,
verbose = T,
denom = "all")
write.table(results, file=args[3], quote=FALSE, sep='\t', col.names = NA)
message("Results table saved to ", args[3])
ANCOM(Analysis of Composition of Microbiomes)是一种用于分析微生物组数据的统计方法,专门设计来识别和比较不同样本或处理组之间的微生物组成差异。其核心原理包括以下几个步骤:
ANCOM的优点包括能够处理稀疏数据、保持较低的误报率以及对异常值具有鲁棒性。然而,它也存在一些限制,例如对数据的分布假设敏感,对样本数目和特征维度的要求较高
deps = c("exactRankTests", "nlme", "dplyr", "ggplot2", "compositions")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
install.packages(dep)
}
library(dep, character.only = TRUE)
}
#args[4] will contain path for the ancom code
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 3) {
stop("At least three arguments must be supplied", call.=FALSE)
}
source(args[[4]])
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
groupings$Sample <- rownames(groupings)
prepro <- feature_table_pre_process(feature_table = ASV_table,
meta_data = groupings,
sample_var = 'Sample',
group_var = NULL,
out_cut = 0.05,
zero_cut = 0.90,
lib_cut = 1000,
neg_lb=FALSE)
feature_table <- prepro$feature_table
metadata <- prepro$meta_data
struc_zero <- prepro$structure_zeros
#run ancom
main_var <- colnames(groupings)[1]
p_adj_method = "BH"
alpha=0.05
adj_formula=NULL
rand_formula=NULL
res <- ANCOM(feature_table = feature_table,
meta_data = metadata,
struc_zero = struc_zero,
main_var = main_var,
p_adj_method = p_adj_method,
alpha=alpha,
adj_formula = adj_formula,
rand_formula = rand_formula)
write.table(res$out, file=args[3], quote=FALSE, sep="\t", col.names = NA)
Corncob 是一种用于微生物组数据分析的R包,它专门用于对微生物相对丰度进行建模并测试协变量对相对丰度的影响。其核心原理包括以下几个方面:
#### Run Corncob
library(corncob)
library(phyloseq)
#install corncob if its not installed.
deps = c("corncob")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if(dep=="corncob"){
devtools::install_github("bryandmartin/corncob")
}
else
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
#run corncob
#put data into phyloseq object.
colnames(groupings)
colnames(groupings)[1] <- "places"
OTU <- phyloseq::otu_table(ASV_table, taxa_are_rows = T)
sampledata <- phyloseq::sample_data(groupings, errorIfNULL = T)
phylo <- phyloseq::merge_phyloseq(OTU, sampledata)
my_formula <- as.formula(paste("~","places",sep=" ", collapse = ""))
my_formula
results <- corncob::differentialTest(formula= my_formula,
phi.formula = my_formula,
phi.formula_null = my_formula,
formula_null = ~ 1,
test="Wald", data=phylo,
boot=F,
fdr_cutoff = 0.05)
write.table(results$p_fdr, file=args[[3]], sep="\t", col.names = NA, quote=F)
write.table(results$p, file=paste0(args[[3]], "_uncor", sep=""), sep="\t", col.names = NA, quote=F)
DESeq2是一种用于微生物组数据差异分析的统计方法,特别适用于处理计数数据,如RNA-seq数据或微生物组的OTU(操作分类单元)计数。DESeq2的核心原理包括以下几个关键步骤:
#Run_DeSeq2
deps = c("DESeq2")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
}
library(dep, character.only = TRUE)
}
library(DESeq2)
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
colnames(groupings)[1] <- "Groupings"
#Run Deseq2
dds <- DESeq2::DESeqDataSetFromMatrix(countData = ASV_table,
colData=groupings,
design = ~ Groupings)
dds_res <- DESeq2::DESeq(dds, sfType = "poscounts")
res <- results(dds_res, tidy=T, format="DataFrame")
rownames(res) <- res$row
res <- res[,-1]
write.table(res, file=args[3], quote=FALSE, sep="\t", col.names = NA)
message("Results written to ", args[3])
EdgeR(Empirical Analysis of Digital Gene Expression in R)是一种用于分析计数数据的统计方法,特别适用于微生物组学、转录组学和其他高通量测序技术产生的数据。EdgeR的核心原理包括以下几个关键步骤:
deps = c("edgeR", "phyloseq")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
### Taken from phyloseq authors at: https://joey711.github.io/phyloseq-extensions/edgeR.html
phyloseq_to_edgeR = function(physeq, group, method="RLE", ...){
require("edgeR")
require("phyloseq")
# Enforce orientation.
if( !taxa_are_rows(physeq) ){ physeq <- t(physeq) }
x = as(otu_table(physeq), "matrix")
# Add one to protect against overflow, log(0) issues.
x = x + 1
# Check `group` argument
if( identical(all.equal(length(group), 1), TRUE) & nsamples(physeq) > 1 ){
# Assume that group was a sample variable name (must be categorical)
group = get_variable(physeq, group)
}
# Define gene annotations (`genes`) as tax_table
taxonomy = tax_table(physeq, errorIfNULL=FALSE)
if( !is.null(taxonomy) ){
taxonomy = data.frame(as(taxonomy, "matrix"))
}
# Now turn into a DGEList
y = DGEList(counts=x, group=group, genes=taxonomy, remove.zeros = TRUE, ...)
# Calculate the normalization factors
z = calcNormFactors(y, method=method)
# Check for division by zero inside `calcNormFactors`
if( !all(is.finite(z$samples$norm.factors)) ){
stop("Something wrong with edgeR::calcNormFactors on this data,
non-finite $norm.factors, consider changing `method` argument")
}
# Estimate dispersions
return(estimateTagwiseDisp(estimateCommonDisp(z)))
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
OTU <- phyloseq::otu_table(ASV_table, taxa_are_rows = T)
sampledata <- phyloseq::sample_data(groupings, errorIfNULL = T)
phylo <- phyloseq::merge_phyloseq(OTU, sampledata)
test <- phyloseq_to_edgeR(physeq = phylo, group=colnames(groupings)[1])
et = exactTest(test)
tt = topTags(et, n=nrow(test$table), adjust.method="fdr", sort.by="PValue")
res <- tt@.Data[[1]]
write.table(res, file=args[3], quote=F, sep="\t", col.names = NA)
Limma-Voom-TMM(Trimmed Mean of M-values)是一种用于微生物组数据差异分析的方法,它结合了limma、voom和TMM(Trimmed Mean of M-values)技术的优势,以处理和分析来自高通量测序技术(如RNA-seq)的数据。下面是Limma-Voom-TMM方法的基本原理:
deps = c("edgeR")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
DGE_LIST <- DGEList(ASV_table)
### do normalization
### Reference sample will be the sample with the highest read depth
### check if upper quartile method works for selecting reference
Upper_Quartile_norm_test <- calcNormFactors(DGE_LIST, method="upperquartile")
summary_upper_quartile <- summary(Upper_Quartile_norm_test$samples$norm.factors)[3]
if(is.na(summary_upper_quartile) | is.infinite(summary_upper_quartile)){
message("Upper Quartile reference selection failed will use find sample with largest sqrt(read_depth) to use as reference")
Ref_col <- which.max(colSums(sqrt(ASV_table)))
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method = "TMM", refColumn = Ref_col)
fileConn<-file(args[[4]])
writeLines(c("Used max square root read depth to determine reference sample"), fileConn)
close(fileConn)
}else{
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method="TMM")
}
## make matrix for testing
colnames(groupings) <- c("comparison")
mm <- model.matrix(~comparison, groupings)
voomvoom <- voom(DGE_LIST_Norm, mm, plot=F)
fit <- lmFit(voomvoom,mm)
fit <- eBayes(fit)
res <- topTable(fit, coef=2, n=nrow(DGE_LIST_Norm), sort.by="none")
write.table(res, file=args[3], quote=F, sep="\t", col.names = NA)
Limma-Voom-TMMwsp(Trimmed Mean of M-values with Singleton Pairing)是一种用于处理和分析微生物组数据的差异分析方法,它结合了几种不同的统计技术来提高分析的准确性和可靠性。下面是Limma-Voom-TMMwsp方法的基本原理:
deps = c("edgeR")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
DGE_LIST <- DGEList(ASV_table)
### do normalization
### Reference sample will be the sample with the highest read depth
### check if upper quartile method works for selecting reference
Upper_Quartile_norm_test <- calcNormFactors(DGE_LIST, method="upperquartile")
summary_upper_quartile <- summary(Upper_Quartile_norm_test$samples$norm.factors)[3]
if(is.na(summary_upper_quartile) | is.infinite(summary_upper_quartile)){
message("Upper Quartile reference selection failed will use find sample with largest sqrt(read_depth) to use as reference")
Ref_col <- which.max(colSums(sqrt(ASV_table)))
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method = "TMMwsp", refColumn = Ref_col)
fileConn<-file(args[[4]])
writeLines(c("Used max square root read depth to determine reference sample"), fileConn)
close(fileConn)
}else{
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method="TMMwsp")
}
## make matrix for testing
colnames(groupings) <- c("comparison")
mm <- model.matrix(~comparison, groupings)
voomvoom <- voom(DGE_LIST_Norm, mm, plot=F)
fit <- lmFit(voomvoom,mm)
fit <- eBayes(fit)
res <- topTable(fit, coef=2, n=nrow(DGE_LIST_Norm), sort.by="none")
write.table(res, file=args[3], quote=F, sep="\t", col.names = NA)
MaAsLin2是一种用于微生物组数据差异分析的统计方法,它专门设计用来处理微生物组数据的复杂性,包括噪声、稀疏性(零膨胀)、高维度、极端非正态分布以及通常以计数或组成性测量的形式出现的数据。以下是MaAsLin2方法的基本原理:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#BiocManager::install("Maaslin2")
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
library(Maaslin2)
ASV_table <- data.frame(t(ASV_table), check.rows = F, check.names = F, stringsAsFactors = F)
fit_data <- Maaslin2(
ASV_table, groupings,args[3], transform = "AST",
fixed_effects = c(colnames(groupings[1])),
standardize = FALSE, plot_heatmap = F, plot_scatter = F)
metagenomeSeq是一种用于分析微生物组测序数据的统计学方法,它可以帮助研究人员发现不同条件下微生物组的差异丰度。以下是metagenomeSeq方法的基本原理:
deps = c("metagenomeSeq")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
data_list <- list()
data_list[["counts"]] <- ASV_table
data_list[["taxa"]] <- rownames(ASV_table)
pheno <- AnnotatedDataFrame(groupings)
pheno
counts <- AnnotatedDataFrame(ASV_table)
feature_data <- data.frame("ASV"=rownames(ASV_table),
"ASV2"=rownames(ASV_table))
feature_data <- AnnotatedDataFrame(feature_data)
rownames(feature_data) <- feature_data@data$ASV
test_obj <- newMRexperiment(counts = data_list$counts, phenoData = pheno, featureData = feature_data)
p <- cumNormStat(test_obj, pFlag = T)
p
test_obj_norm <- cumNorm(test_obj, p=p)
fromula <- as.formula(paste(~1, colnames(groupings)[1], sep=" + "))
pd <- pData(test_obj_norm)
mod <- model.matrix(fromula, data=pd)
regres <- fitFeatureModel(test_obj_norm, mod)
res_table <- MRfulltable(regres, number = length(rownames(ASV_table)))
write.table(res_table, file=args[3], quote=F, sep="\t", col.names = NA)
T-test-rarefaction是一种用于微生物组数据分析的统计方法,它结合了两种不同的技术:t检验和稀释抽样(rarefaction)。这种方法特别适用于处理微生物群落的计数数据,尤其是在样本数量有限或样本之间测序深度不均一的情况下。以下是T-test-rarefaction方法的基本原理:
### Run T Test on rarified
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
colnames(groupings)
colnames(groupings)[1] <- "places"
#apply wilcox test to rarified table
pvals <- apply(ASV_table, 1, function(x) t.test(x ~ groupings[,1], exact=F)$p.value)
write.table(pvals, file=args[[3]], sep="\t", col.names = NA, quote=F)
Wilcox-CLR是一种结合了中心对数比(Centered Log-Ratio, CLR)转换和Wilcoxon秩和检验的微生物组数据分析方法。这种方法特别适用于处理组成性数据(compositional data),即数据中各部分的总和为一个固定值,例如微生物群落中各物种的相对丰度之和为1。以下是Wilcox-CLR方法的基本原理:
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
colnames(groupings)
colnames(groupings)[1] <- "places"
#add pseudo count
CLR_table <- data.frame(apply(ASV_table + 1, 2, function(x){log(x) - mean(log(x))}))
## get clr table
#apply wilcox test to rarified table
pvals <- apply(CLR_table, 1, function(x) wilcox.test(x ~ groupings[,1], exact=F)$p.value)
write.table(pvals, file=args[[3]], sep="\t", col.names = NA, quote=F)
Wilcox-rare是一种结合了稀释抽样(Rarefaction)和Wilcoxon秩和检验的微生物组数据分析方法。这种方法适用于处理微生物群落的计数数据,尤其是在样本数量有限或样本之间测序深度不均一的情况下。以下是Wilcox-rare方法的基本原理:
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
colnames(groupings)
colnames(groupings)[1] <- "places"
#apply wilcox test to rarified table
pvals <- apply(ASV_table, 1, function(x) wilcox.test(x ~ groupings[,1], exact=F)$p.value)
write.table(pvals, file=args[[3]], sep="\t", col.names = NA, quote=F)
该研究证实,上述不同差异分析方法识别出了截然不同的数量和显著ASV(扩增序列变体)集合,结果依赖于数据预处理。对于许多方法来说,识别出的特征数量与数据的某些方面相关,如样本大小、测序深度以及群落差异的效应大小。ALDEx2和ANCOM-II在不同研究中产生最一致的结果,并且与不同方法结果交集的一致性最好。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。