前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >TCGA数据挖掘(四):表达差异分析(4)

TCGA数据挖掘(四):表达差异分析(4)

作者头像
DoubleHelix
发布2019-09-18 11:03:33
4.2K0
发布2019-09-18 11:03:33
举报
文章被收录于专栏:生物信息云生物信息云

在之前我们的文章:TCGA数据挖掘(三):表达差异分析中,我们利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数进行差异表达分析,我们也提到可以选择基于limma或edgeR包进行分析,TCGA数据挖掘(三):表达差异分析这一讲中我们利用的是edgeR包,之后我们在文章:TCGA数据挖掘(四):表达差异分析(2)TCGA数据挖掘(四):表达差异分析(3)中分别也介绍了其他方法的差异分析,包括edgeR和DESeq包,今天这一讲,我们就利用TCGAbiolinks包中的TCGAanalyze_DEA函数基于limma包进行差异分析。

数据下载

基因表达数据的下载

数据下载代码和之前的一样,这里再提供一次。避免出错不知道原因。

代码语言:javascript
复制
setwd("E:\\BioInfoCloud\\TCGABiolinks\\case_study1")
# 加载相应的包,可能会需要其他包,提示错误就安装缺少的包。
# 因为每个人已经安装的包不一样。
library(TCGAbiolinks)
library(plyr)
library(limma)
library(biomaRt)
library(SummarizedExperiment)
# 请求数据。
query <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")
# 从query中获取结果表,它可以选择带有cols参数的列,并使用rows参数返回若干行。
# 1222个barcode                 
samplesDown <- getResults(query,cols=c("cases"))
# samplesDown中筛选出TP(主要实体瘤)样本的barcode
# 1102个barcode
dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")
# 113个barcode
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")

# 选择100个正常组织和100个肿瘤组织样本作为研究对象
dataSmTP_short <- dataSmTP[1:100]
dataSmNT_short <- dataSmNT[1:100]
# 根据前面的筛选,再次请求数据
queryDown <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))
#读取下载的数据并将其准备到R对象中,在工作目录生成brca_case1.rda文件,
#同时还产生Human_genes__GRCh38_p12_.rda文件
GDCdownload(query = queryDown)
dataPrep1 <- GDCprepare(query = queryDown, save = TRUE, save.filename = "brca_case1.rda")

数据处理

代码语言:javascript
复制
# 去除dataPrep1中的异常值,dataPrep数据中含有肿瘤组织和正常组织的数据
dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts")
#使用EDASeq进行标准化
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent")
#根据分位数一处count较低的基因。
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                    qnt.cut =  0.25)
# 将计数数据转换为log2计数/百万(LogCPM),估计均值-方差关系并使用此关系计算适当的观测级别权重。
# 然后,数据即可用于线性建模。
v.dataFilt<-voom(dataFilt)
#in order to have 2 batches with multiple samples and avoid batches with one sample
#the user need to call first the get_IDs function on the top of this script if this has not been done already
c1<-which(get_IDs(dataPrep)$tss=="E9")
c2<-which(get_IDs(dataPrep)$tss=="E2")
#taking log transformed data for exploration of batch effects
TCGAbatch_Correction(tabDF = v.dataFilt$E[,c(c1,c2)], batch.factor="TSS", adjustment=NULL)
#if you get error in plot.new():figure margins too large - please use:
#par(mar=c(1,1,1,1))
#and then rerun the command above
代码语言:javascript
复制
#### After exploration of the batches we can continue to work on the original gene matrix for DEA ###
###############Tumor purity filtering###########
###vector containing all TCGA barcodes that hhave 60% tumor purity or more
Purity.BRCA<-TCGAtumor_purity(colnames(dataPrep1), 0, 0, 0, 0, 0.6)$pure_barcodes
################DEA with Molecular subtypes############
####Molecular subtypes:
####Filtering data so all samples have a pam50 subtype for BRCA:

#diff contains TCGA samples that have an available molecular subtype
###Also Applying Tumor purity filtering by making a set difference
#documentation for TCGA_MolecularSubtype is available
##########Important########
####in this example no change happens because all the TCGA barcodes
####have available subtype info
###setdiff() is here used to remove samples with no subtype info from the TCGA BRCA samples that satisfy
###tumor purity criteria
diff<-setdiff(Purity.BRCA, TCGA_MolecularSubtype(colnames(dataPrep[,dataSmTP_short]))$filtered)
dataFilt.brca.cancer<-dataPrep[,diff]
dataFilt.brca.normal<-dataPrep[,dataSmNT_short]
dataFilt.brca<-cbind(dataFilt.brca.normal, dataFilt.brca.cancer)
mol_subtypes<-c(rep("Normal", 100), TCGA_MolecularSubtype(colnames(dataFilt.brca.cancer))$subtypes$subtype)
# All barcodes have available molecular subtype info
mol_subtypes<-make.names(mol_subtypes)
# to convert ENSEMBL id to gene name
rownames(dataFilt.brca)<-rowData(dataPrep1)$external_gene_name
# we redo here the normalization and filtering steps 
dataNorm.brca <- TCGAanalyze_Normalization(tabDF = dataFilt.brca,
                                           geneInfo = geneInfo,
                                           method = "gcContent")
# 将标准化后的数据再过滤,得到最终的数据
dataFilt.brca.final <- TCGAanalyze_Filtering(tabDF = dataNorm.brca,
                                       method = "quantile", 
                                       qnt.cut =  0.25)

表达差异分析

这里利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数,是基于limma包的差异分析。

代码语言:javascript
复制
DEG.brca.TSS <- TCGAanalyze_DEA(MAT=dataFilt.brca.final,
                            pipeline="limma",
                            batch.factors = c("TSS"),
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            voom = TRUE,
                            Condtypes = mol_subtypes,
                            contrast.formula = "Mycontrast = (BRCA.LumA+BRCA.LumB)/2 -Normal")
#in this DEA we use Normal as a reference, thus genes with LogFC > 1 are up regulated in the subtypes
#with respect to the normal samples and viceversa for the LogFC < -1.

write.csv(DEG.brca.TSS, "DEGsBRCA_limma_091018.csv", quote = FALSE)
#3241 genes identified

#to check how many with logFC > 5 (for plotting purposes)
DEG.brca.filt.TSS<-DEG.brca.TSS$Mycontrast[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6), ]
#47 genes identified

可视化 利用火山图进行可视化。 TCGAVisualize_volcano(DEG.brca.TSS$Mycontrast$logFC, DEG.brca.TSS$Mycontrast$adj.P.Val, filename = "LuminalABvsNormal_FC6.TSS.pdf", xlab = "logFC", names = rownames(DEG.brca.TSS$Mycontrast), show.names = "highlighted", x.cut = 1, y.cut = 0.01, highlight = rownames(DEG.brca.TSS$Mycontrast)[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6)], highlight.color = "orange")

在之前利用edgeR包获得的结果图如下,我们简单比较一下2中方法的差异。

很明显看到2中方法得到的结果是有差异的,这有时候我们会纠结那种方法好,这个就看你自己的研究啦,什么结果符合自己的用什么,当然这有点投机取巧的感觉,最好是2中方法得到后取交集。 不过下面我们对这2种方法得到的结果进行一个相关性评估。 limma和edgeR结果的相关性比较library(ggplot2) DEGs_limma <- read.csv("DEGsBRCA_limma_091018.csv", row.names = 1) DEGs_edgeR <- read.csv("DEGsBRCA_edgeR_091018.csv", row.names = 1) genes_toTest <- rownames(DEGs_limma)[1:1000] genes_common <- intersect(genes_toTest, rownames(DEGs_edgeR)) dataset <- subset(DEGs_limma, rownames(DEGs_limma) %in% genes_common, select = "Mycontrast.logFC") dataset <- cbind(DEGs_edgeR$Mycontrast.logFC[match(rownames(dataset),rownames(DEGs_edgeR))], dataset) colnames(dataset) <- c("logFC_edgeR", "logFC_limma") pdf("scatterplot_logFC_limma_edgeR_top1000.pdf", width=5, height=3) ggplot(dataset, aes(x=logFC_limma,y=logFC_edgeR))+geom_point()+ xlab("logFC_limma")+ylab("logFC_edgeR")+ theme(legend.title=element_blank(),axis.title=element_text(size=10),axis.text=element_text(size=10), legend.text=element_text(size=20)) dev.off() cor(dataset$logFC_edgeR, dataset$logFC_limma) #0.9936 这2种方法得到的结果虽然有差异,但很小得到的大多数差异基因是一样的。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-09-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 MedBioInfoCloud 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 数据下载
  • 数据处理
  • 表达差异分析
  • 可视化 利用火山图进行可视化。 TCGAVisualize_volcano(DEG.brca.TSS$Mycontrast$logFC, DEG.brca.TSS$Mycontrast$adj.P.Val, filename = "LuminalABvsNormal_FC6.TSS.pdf", xlab = "logFC", names = rownames(DEG.brca.TSS$Mycontrast), show.names = "highlighted", x.cut = 1, y.cut = 0.01, highlight = rownames(DEG.brca.TSS$Mycontrast)[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6)], highlight.color = "orange")
  • ?
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档