前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >TCGA的28篇教程-风险因子关联图-一个价值1000但是迟到的答案

TCGA的28篇教程-风险因子关联图-一个价值1000但是迟到的答案

作者头像
生信技能树
发布2018-08-16 17:13:28
1.2K0
发布2018-08-16 17:13:28
举报
文章被收录于专栏:生信技能树生信技能树

早在 2017-03-13 我就在生信技能树推出过绘图交易专区: [有偿专区]TCGA 预后作图 那个时候隐隐约约知道这是一个很大的市场,可惜人的精力是有限的, 我需要持续更新 10000+ 生物信息学基础教程,而且那个时候也没有一个愿意负责这块的小伙伴,只好搁置到了现在。

其实那个需求很简单:https://www.ncbi.nlm.nih.gov/pubmed/24893932 文章里面也说的很清楚,如下:

代码语言:javascript
复制
# 
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051943/figure/djw200-F2/
# https://www.ncbi.nlm.nih.gov/pubmed/24893932
# Sixteen of the 111 most significantly altered miRNAs were associated with OS across different clinical subclasses of the TCGA-derived LUAD cohort. 
# A linear prognostic model of eight miRNAs (miR-31, miR-196b, miR-766, miR-519a-1, miR-375, miR-187, miR-331 and miR-101-1) was constructed and weighted by the importance scores from the supervised principal component method to divide patients into high- and low-risk groups. 
# Patients assigned to the high-risk group exhibited poor OS compared with patients in the low-risk group (hazard ratio [HR]=1.99, P <0.001). 
# The eight-miRNA signature is an independent prognostic marker of OS of LUAD patients and demonstrates good performance for predicting 5-year OS (Area Under the respective ROC Curves [AUC] = 0.626, P = 0.003), especially for non-smokers (AUC = 0.686, P = 0.023).

恰好我马上要去做TCGA商业演讲,这个知识点在授课范围,也顺便解决一下这个一年前的问题。

首先下载好TCGA的LUAD的miRNA表达数据和临床数据

下载方式我就不多说了,大家看我以前的教程:

这里选取最方便的来举例说明咯:

代码语言:javascript
复制
rm(list=ls())
library(survival)
library(survminer)
if(F){
  library(RTCGA.miRNASeq) 
  ??RTCGA.miRNASeq
  s=rownames(LUAD.miRNASeq)[seq(1,nrow(LUAD.miRNASeq),by=3)]
  expr <- expressionsTCGA(LUAD.miRNASeq)
  dim(expr)
  expr[1:40,1:4]
  expr=as.data.frame(expr[seq(1,nrow(expr),by=3),3:ncol(expr)])
  mi=colnames(expr)
  expr=apply(expr,1,as.numeric) 
  colnames(expr)=s
  rownames(expr)=mi
  expr[1:4,1:4]
  expr=na.omit(expr)
  dim(expr)
  expr=expr[apply(expr, 1,function(x){sum(x>1)>10}),]
  dim(expr)

  library(RTCGA.clinical) 
  ??RTCGA.clinical
  meta <- LUAD.clinical
  tmp=as.data.frame(colnames(meta))
  meta[(grepl('patient.bcr_patient_barcode',colnames(meta)))]
  meta[(grepl('patient.days_to_last_followup',colnames(meta)))]
  meta[(grepl('patient.days_to_death',colnames(meta)))]
  meta[(grepl('patient.vital_status',colnames(meta)))]
  ## patient.race  # patient.age_at_initial_pathologic_diagnosis # patient.gender 
  # patient.stage_event.clinical_stage
  meta=as.data.frame(meta[c('patient.bcr_patient_barcode','patient.vital_status',
                            'patient.days_to_death','patient.days_to_last_followup',
                            'patient.race',
                            'patient.age_at_initial_pathologic_diagnosis',
                            'patient.gender' ,
                            'patient.stage_event.pathologic_stage')])
  #meta[(grepl('patient.stage_event.pathologic_stage',colnames(meta)))]

  save(expr,meta,file = 'TCGA-LUAD-miRNA-example.Rdata')
} 
load(file = 'TCGA-LUAD-miRNA-example.Rdata')

接着整理好数据格式

还是那句老话,数据分析师其实大部分时间花在数据整理上面:

代码语言:javascript
复制
group_list=ifelse(substr(colnames(expr),14,15)=='01','tumor','normal')
table(group_list)

if(!file.exists('TCGA-LUAD-survival_input.Rdata')){
  exprSet=na.omit(expr)
  exprSet=exprSet[,group_list=='tumor']

  head(meta)
  colnames(meta)
  meta[,3][is.na(meta[,3])]=0
  meta[,4][is.na(meta[,4])]=0
  meta$days=as.numeric(meta[,3])+as.numeric(meta[,4])
  meta=meta[,c(1:2,5:9)]
  colnames(meta)
  colnames(meta)=c('ID','event','race','age','gender','stage',"days") 
  library(survival)
  library(survminer)
  meta$event=ifelse(meta$event=='alive',0,1)
  meta$age=as.numeric(meta$age)
  library(stringr) 
  meta$stage=str_split(meta$stage,' ',simplify = T)[,2]
  table(  meta$stage)
  boxplot(meta$age)
  meta$age_group=ifelse(meta$age>median(meta$age,na.rm = T),'older','younger')
  table(meta$race)
  meta$time=meta$days/30
  phe=meta

  head(phe)
  phe$ID=toupper(phe$ID) 
  phe=phe[match(substr(colnames(exprSet),1,12),phe$ID),]
  head(phe)
  exprSet[1:4,1:4]

  save(exprSet,phe,file='TCGA-LUAD-survival_input.Rdata')
}

load(file='TCGA-LUAD-survival_input.Rdata')
head(phe)
exprSet[1:4,1:4]

清洗干净的数据如下:

代码语言:javascript
复制
> head(phe)
                    ID event race age gender stage days age_group      time
52.70.0   TCGA-05-4244     0 <NA>  70   male    iv    0     older  0.000000
52.70.0.2 TCGA-05-4249     0 <NA>  67   male    ib 1158     older 38.600000
52.70.0.3 TCGA-05-4250     1 <NA>  79 female  iiia  121     older  4.033333
58.73.0   TCGA-05-4382     0 <NA>  68   male    ib  607     older 20.233333
58.73.0.1 TCGA-05-4389     0 <NA>  70   male    ia 1369     older 45.633333
58.73.0.2 TCGA-05-4395     1 <NA>  76   male  iiib    0     older  0.000000
> exprSet[1:4,1:4]
             TCGA-05-4244-01A-01T-1108-13 TCGA-05-4249-01A-01T-1108-13
hsa-let-7a-1                         3985                         8916
hsa-let-7a-2                         7947                        17800
hsa-let-7a-3                         4128                         9079
hsa-let-7b                           9756                        32960

就很容易做分析了。

挑选感兴趣的基因构建coxph模型

后期我会录制视频讲解其中原理,现在大家就先看看吧,反正我写的代码都是可以运行的,不像其它垃圾教程。

代码语言:javascript
复制
# miR-31, miR-196b, miR-766, miR-519a-1, miR-375, miR-187, miR-331 and miR-101-1
# hsa-mir-31,hsa-mir-196b,hsa-mir-766,hsa-mir-519a-1,hsa-mir-375,hsa-mir-187,hsa-mir-331,hsa-mir-101-1
e=t(exprSet[c('hsa-mir-31','hsa-mir-196b','hsa-mir-766','hsa-mir-519a-1','hsa-mir-375','hsa-mir-187','hsa-mir-331','hsa-mir-101-1'),])
e=log2(e+1)
colnames(e)=c('miR31','miR196b','miR766','miR519a1','miR375','miR187','miR331','miR101')
dat=cbind(phe,e)

dat$gender=factor(dat$gender)
dat$stage=factor(dat$stage)

colnames(dat) 
s=Surv(time, event) ~ miR31+miR196b+miR766+miR519a1+miR375+miR187+miR331+miR101
model <- coxph(s, data = dat )
summary(model,data=dat)
options(scipen=1)
ggforest(model, data =dat, 
         main = "Hazard ratio", 
         cpositions = c(0.10, 0.22, 0.4), 
         fontsize = 1.0, 
         refLabel = "1", noDigits = 4)


# 对于生存数据预后模型评价很多采用C-index ,但c-index展示没有roc曲线图来的直观
new_dat=dat

fp <- predict(model,new_dat,type="risk")
fp <- predict(model,new_dat,type="expected") 
plot(fp,phe$days)
fp <- predict(model,new_dat) 
basehaz(model) 
library(Hmisc)
options(scipen=200)
with(new_dat,rcorr.cens(fp,Surv(time, event)  ))
# http://dni-institute.in/blogs/cox-regression-interpret-result-and-predict/

最后进行可视化

出图的代码往往很简单:

代码语言:javascript
复制
library(cowplot)
library(pheatmap)
# https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html
fp_dat=data.frame(s=1:length(fp),v=as.numeric(sort(fp )))
sur_dat=data.frame(s=1:length(fp),
                   t=phe[names(sort(fp )),'time'] ,
                   e=phe[names(sort(fp )),'event']  ) 
sur_dat$e=ifelse(sur_dat$e==0,'alive','death')
exp_dat=new_dat[names(sort(fp )),10:17]
plot.point=ggplot(fp_dat,aes(x=s,y=v))+geom_point()
plot.sur=ggplot(sur_dat,aes(x=s,y=t))+geom_point(aes(col=e))
mycolors <- colorRampPalette(c("black", "green", "red"), bias = 1.2)(100)
tmp=t(scale(exp_dat))
tmp[tmp > 1] = 1
tmp[tmp < -1] = -1
plot.h=pheatmap(tmp,col= mycolors,show_colnames = F,cluster_cols = F)
plot_grid(plot.point, plot.sur, plot.h$gtable,
          labels = c("A", "B","C"),
          align = 'v',ncol = 1)

但是出图后的美化,就考验人了,这里不赘述啦。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-08-08,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 首先下载好TCGA的LUAD的miRNA表达数据和临床数据
  • 接着整理好数据格式
  • 挑选感兴趣的基因构建coxph模型
  • 最后进行可视化
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档