对单细胞数据进行亚群注释之后,我们往往想比较某亚群,例如CD8Tex,是倾向于分布在实验组还是对照组,例如癌组织,癌旁组织,转移癌组织,淋巴组织?这时候有很多策略去做这种多组间的比较。
第一种,是《Therapy-Induced Evolution of Human Lung Cancer Revealed by Single-Cell RNA Sequencing》这篇Cell的做法,这篇文章有三种处理组(TN,RD和PD),或许是因为用的Smart-seq2测序,每个样本得到的细胞数量确实不多,因此作者简单粗暴的把同一组内的样本细胞加和计算亚群细胞的频率,进行统计,绘图如下:
这种做法其实是存在一定问题的,它画不出所有样本分布的散点图,因为本质上作者把同一组的亚群看成一个样本。我在单细胞思考:Cell作者一定是对的吗?此文中复现了作者的处理思路和图表。
我也在如上推文中给出了第二种多组间亚群比较的策略,即按照每个样本的细胞总和进行百分比的校正,然后比较频率。这也是绝大多数文章采取的策略,例如《Pan-cancer single-cell landscape of tumor-infiltrating T cells》张泽民团队比较多癌种某单细胞亚群的频率分布差异,一个散点代表一个样本:
第三种策略也是出现在《Pan-cancer single-cell landscape of tumor-infiltrating T cells》张泽民老师的这篇文章中,即Fig1F的这幅图,利用OR比值比的统计学方法,比较血液,正常组织和肿瘤组织,各单细胞亚群的分布差异:
第四种策略也是张泽民团队经常使用的一个统计方法,Ro/e,这个指标是观察到的细胞数与期望细胞数的比值,用于量化每个亚群对组织的偏好程度。
本文针对第三种策略OR比值比做一介绍和图表复现。
见《Pan-cancer single-cell landscape of tumor-infiltrating T cells》方法部分:
Indexes characterizing meta-clusters and STARTRAC analysis To characterize the tissue distribution of meta-clusters, odds ratios (OR) were calculated and used to indicate preferences. Specifically, for each combination of meta-cluster i and tissue j, a 2 by 2 contingency table was constructed, which contained the number of cells of meta-cluster i in tissue j, the number of cells of meta-cluster i in other tissues, the number of cells of non-i meta-clusters in tissue j, the number of cells of non-i meta-clusters in other tissues. Then Fisher’s exact test was applied on this contingency table, thus OR and corresponding p-value could be obtained. P-values were adjusted using the BH method implemented in the R function p.adjust. We found that all ORs > 1.5 or ORs < 0.5 had adjusted p-values < 1e-10. Hence, a higher OR with a value > 1.5 indicated that meta-cluster i was more preferred to distribute in tissue j, a lower OR with a value < 0.5 indicated that meta-cluster i was preferred not to distribute in tissue j.
总结来说,OR>1.5越倾向在该组织分布,OR<0.5越不倾向在该组织中分布。
下面利用作者给的meta.data数据和代码计算OR值:
有一些R包需要提前安装一下:
library("sscVis")
library("data.table")
library("grid")
library("cowplot")
library("ggrepel")
library("readr")
library("plyr")
library("ggpubr")
library("ggplot2")
#设置图片输出目录
out.prefix <- "./Fig1"
读入数据:
meta.tb <- read_rds("../data/metaInfo/panC.freq.all.meta.tb.rds")
meta.tb文件实际上就是平时我们储存在Seurat单细胞对象里的meta.data:
然后使用的两个函数(改编自作者提供的代码,见文末)统计OR值和可视化:
OR.CD8.list <- do.tissueDist(cellInfo.tb=meta.tb[stype=="CD8" & treatment=="baseline",],
out.prefix=sprintf("%s.STARTRAC.dist.T.baseline.CD8",out.prefix),
pdf.width=4,pdf.height=6,verbose=1)
OR值储存在这个list里:
OR.CD8.list$OR.dist.mtx
还有p值:
OR.CD8.list$p.dist.tb
使用的函数改编自作者提供的代码,如下:
do.tissueDist <- function(cellInfo.tb = cellInfo.tb,
meta.cluster = cellInfo.tb$meta.cluster,
colname.patient = "patient",
loc = cellInfo.tb$loc,
out.prefix,
pdf.width=3,
pdf.height=5,
verbose=0){
##input data
library(data.table)
dir.create(dirname(out.prefix),F,T)
cellInfo.tb = data.table(cellInfo.tb)
cellInfo.tb$meta.cluster = as.character(meta.cluster)
if(is.factor(loc)){
cellInfo.tb$loc = loc
}else{cellInfo.tb$loc = as.factor(loc)}
loc.avai.vec <- levels(cellInfo.tb[["loc"]])
count.dist <- unclass(cellInfo.tb[,table(meta.cluster,loc)])[,loc.avai.vec]
freq.dist <- sweep(count.dist,1,rowSums(count.dist),"/")
freq.dist.bin <- floor(freq.dist * 100 / 10)
print(freq.dist.bin)
{
count.dist.melt.ext.tb <- test.dist.table(count.dist)
p.dist.tb <- dcast(count.dist.melt.ext.tb,rid~cid,value.var="p.value")
OR.dist.tb <- dcast(count.dist.melt.ext.tb,rid~cid,value.var="OR")
OR.dist.mtx <- as.matrix(OR.dist.tb[,-1])
rownames(OR.dist.mtx) <- OR.dist.tb[[1]]
}
sscVis::plotMatrix.simple(OR.dist.mtx,
out.prefix=sprintf("%s.OR.dist",out.prefix),
show.number=F,
waterfall.row=T,par.warterfall = list(score.alpha = 2,do.norm=T),
exp.name=expression(italic(OR)),
z.hi=4,
palatte=viridis::viridis(7),
pdf.width = 4, pdf.height = pdf.height)
if(verbose==1){
return(list("count.dist.melt.ext.tb"=count.dist.melt.ext.tb,
"p.dist.tb"=p.dist.tb,
"OR.dist.tb"=OR.dist.tb,
"OR.dist.mtx"=OR.dist.mtx))
}else{
return(OR.dist.mtx)
}
}
test.dist.table <- function(count.dist,min.rowSum=0)
{
count.dist <- count.dist[rowSums(count.dist)>=min.rowSum,,drop=F]
sum.col <- colSums(count.dist)
sum.row <- rowSums(count.dist)
count.dist.tb <- as.data.frame(count.dist)
setDT(count.dist.tb,keep.rownames=T)
count.dist.melt.tb <- melt(count.dist.tb,id.vars="rn")
colnames(count.dist.melt.tb) <- c("rid","cid","count")
count.dist.melt.ext.tb <- as.data.table(ldply(seq_len(nrow(count.dist.melt.tb)), function(i){
this.row <- count.dist.melt.tb$rid[i]
this.col <- count.dist.melt.tb$cid[i]
this.c <- count.dist.melt.tb$count[i]
other.col.c <- sum.col[this.col]-this.c
this.m <- matrix(c(this.c,
sum.row[this.row]-this.c,
other.col.c,
sum(sum.col)-sum.row[this.row]-other.col.c),
ncol=2)
res.test <- fisher.test(this.m)
data.frame(rid=this.row,
cid=this.col,
p.value=res.test$p.value,
OR=res.test$estimate)
}))
count.dist.melt.ext.tb <- merge(count.dist.melt.tb,count.dist.melt.ext.tb,
by=c("rid","cid"))
count.dist.melt.ext.tb[,adj.p.value:=p.adjust(p.value,"BH")]
return(count.dist.melt.ext.tb)
}
- END -