导语
GUIDE ╲
TCGAbiolinks是一个用于TCGA数据综合分析的R软件包 。
背景介绍
TCGA数据库作为癌症研究的首选公共数据库,整合了各种癌症的多组学数据,今天小编给大家带来的正是一个功能强大的TCGA数据分析工具--TCGAbiolinks!
TCGAbiolinks能够通过其GDC应用程序编程接口(API)访问 National Cancer Institute (NCI) Genomic Data Commons (GDC) ,来搜索、下载和准备相关数据,R包还提供了多种函数,以便在R中对数据进行分析和可视化。
R包安装
#稳定版本安装
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
#Development版本安装
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
TCGAbiolinks的使用
01
TCGA数据分析
基因表达数据的预处理
我们可以搜索所需的一些TCGA样本,下载并准备基因表达矩阵。(当然可以直接下载整个数据集!)
#定义一个包含TCGA样本名的样本列表
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
"TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
"TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
"TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
"TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")
# 定义一个包含对应barcode信息的下载矩阵
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = "Illumina HiSeq",
file.type = "results",
barcode = listSamples,
legacy = TRUE)
# 通过GDCdownload函数下载对应数据集
GDCdownload(query)
#准备表达式矩阵,行为geneID,列为samples
BRCARnaseqSE <- GDCprepare(query)
BRCAMatrix <- assay(BRCARnaseqSE,"raw_count") # or BRCAMatrix <- assay(BRCARnaseqSE,"raw_count")
#可以通过箱线图关联和AAIC图来定义异常值
BRCARnaseq_CorOutliers <- TCGAanalyze_Preprocessing(BRCARnaseqSE)
TCGAanalyze_Preprocessing的部分结果展示
基因差异表达分析:TCGAanalyze_DEA & TCGAanalyze_LevelTab
通过TCGAanalyze_DEA功能识别差异基因,并使用TCGAanalyze_LevelTab函数创建一个包含差异表达基因、log Fold Change、false discovery rate(FDR)、Cond1type和Cond2type中样本的基因表达水平的表。
#数据标准化
dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo = geneInfo)
#数据质控
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
#选择正常样本
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
typesample = c("NT"))
#选择癌症样本
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
typesample = c("TP"))
# 差异表达分析
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
mat2 = dataFilt[,samplesTP],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT")
#在正常和肿瘤样本中差异基因的表达值
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",
dataFilt[,samplesTP],dataFilt[,samplesNT])
富集分析:TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot
使用TCGAanalyze_EAcomplete函数对基因集进行富集分析,要查看结果,可以使用TCGAvisualize_EAbarplot函数。
Genelist <- rownames(dataDEGsFiltLevel)
#GO和pathway
system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist))
#富集结果可视化
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
GOBPTab = ansEA$ResBP,
GOCCTab = ansEA$ResCC,
GOMFTab = ansEA$ResMF,
PathTab = ansEA$ResPat,
nRGTab = Genelist,
nBar = 10)
生存分析:TCGAanalyze_survival
使用函数TCGAanalyze_survival绘制生存曲线
#首先获取生存数据
clin.gbm <- GDCquery_clinic("TCGA-GBM", "clinical")
TCGAanalyze_survival(clin.gbm,
"gender",
main = "TCGA Set\n GBM",height = 10, width=10)
差异甲基化区域分析:TCGAanalyze_DMR
data <- TCGAanalyze_DMR(data, groupCol = "methylation_subtype",
group1 = "CIMP.H",
group2="CIMP.L",
p.cut = 10^-5,
diffmean.cut = 0.25,
legend = "State",
plot.filename = "coad_CIMPHvsCIMPL_metvolcano.png")
02
TCGA数据可视化
热图:TCGAvisualize_Heatmap
该函数封装了ComplexHeatmap包,可以方便的绘制热图!
差异表达基因的主成分分析图:TCGAvisualize_PCA
#标准化
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
#质量控制
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
#选择正常样本
group1 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
#选择癌症样本
group2 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
#Principal Component Analysis plot for ntop selected DEGs
pca <- TCGAvisualize_PCA(dataFilt,dataDEGsFiltLevel, ntopgenes = 200, group1, group2)
基因表达和DNA甲基化数据的整合:TCGAvisualize_starburst
starburst plot结合了两个火山图的信息,用于研究DNA甲基化和基因表达。DNA甲基化的FDR校正P值绘制在x轴上,基因表达的FDR校正P值绘制在y轴上。黑色虚线显示FDR调整后的P值为0.01。
starburst <- TCGAvisualize_starburst(coad.SummarizeExperiment,
different.experssion.analysis.data,
group1 = "CIMP.H",
group2 = "CIMP.L",
met.platform = "450K",
genome = "hg19",
met.p.cut = 10^-5,
exp.p.cut = 10^-5,
names = TRUE)
R包参考:
http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
小编总结
TCGA作为目前使用最为广泛的公共数据库,是我们在数据分析过程中的首要选择。随之而来的也产生了许多功能强大的数据分析软件,今天小编给大家带来的TCGAbiolinks正是其中之一,它包含了数据下载、预处理、整合、分析和可视化等诸多功能,并且使用简单方便!如果大家在科研中使用了TCGAbiolinks工具的话,一定要记得引用哦!