前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >了解5个乳腺癌表达数据集

了解5个乳腺癌表达数据集

作者头像
生信技能树
发布2018-07-27 14:29:52
1K0
发布2018-07-27 14:29:52
举报
文章被收录于专栏:生信技能树生信技能树

最近需要学习使用genefu这个包,可以看我在生信技能树分享的韩国人的单细胞转录组乳腺癌研究文献就明白了,然后应用到自己的数据里面,发现这个包的说明书里面提到了5个乳腺癌表达数据集,安装如下:

代码语言:javascript
复制
source("http://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("genefu")

biocLite("breastCancerMAINZ",ask=F,suppressUpdates=T)
biocLite("breastCancerTRANSBIG",ask=F,suppressUpdates=T)
biocLite("breastCancerUPP",ask=F,suppressUpdates=T)
biocLite("breastCancerUNT",ask=F,suppressUpdates=T)
biocLite("breastCancerNKI",ask=F,suppressUpdates=T)

这5个数据集都是以前的研究者发表的,它们 Mainz, Transbig, UPP, and UNT 数据集 分别对应的是: GSE11121,GSE7390,GSE3494,GSE2990.不过NKI数据集并没有上传在GEO里面,是从作者的补充材料里面整理的。

总共1123个病人的数据,临床信息也比较完善。

GSE11121

发表该数据的文章是The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res 2008 Jul 1;68(13):5405-13. PMID: 18593943

使用的是GPL96[HG-U133A] Affymetrix Human Genome U133A Array芯片,we analyzed the gene expression patterns of 200 tumors of patients who were not treated by systemic therapy after surgery using a discovery approach.

对这些病人收集了一些临床信息如下:

  • the biological process of proliferation
  • steroid hormone receptor expression
  • B cell and T cell infiltration.

GSE7390

发表该数据的文章是:Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007 Jun 1;13(11):3207-14. PMID: 17545524

使用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array 芯片,Gene expression profiling of frozen samples from 198 N- systemically untreated patients was performed at the Bordet Institute, blinded to clinical data and independent of Veridex.

GSE3494

发表该数据集的文章是:An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A 2005 Sep 20;102(38):13550-5. PMID: 16141321

使用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array 芯片,freshly frozen breast tumors from a population-based cohort of 315 women representing 65% of all breast cancers resected in Uppsala County, Sweden, from January 1, 1987 to December 31, 1989.

收集的患者信息比较齐全:

代码语言:javascript
复制
INDEX (ID)    
p53 seq mut status (p53+=mutant; p53-=wt)    
p53 DLDA classifier result (0=wt-like, 1=mt-like)    
DLDA error (1=yes, 0=no)    
Elston histologic grade    
ER status    
PgR status    
age at diagnosis    
tumor size (mm)    
Lymph node status    
DSS TIME (Disease-Specific Survival Time in years)    
DSS EVENT (Disease-Specific Survival EVENT; 1=death from breast cancer, 0=alive or censored )

GSE2990

发表该数据集的文章是: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006 Feb 15;98(4):262-72. PMID: 16478745

采用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array芯片,We analyzed microarray data from 189 invasive breast carcinomas and from three published gene expression datasets from breast carcinomas.

因为其重新利用了 GSE3494 的数据,所以 The patients coming from Uppsala Hospital have been also used in other studies as in GSE3494. You can find the common set of patients in removing the abbreviation "UPP_" from the sample names and compare the results with the "INDEX (ID)" from the GSE3494 series.

数据载入R

因为genefu这个包已经把这5个数据集处理好了,可以直接加载到R里面查看。

代码语言:javascript
复制
library(breastCancerMAINZ)
library(breastCancerTRANSBIG)
library(breastCancerUPP)
library(breastCancerUNT)
library(breastCancerNKI)

data(breastCancerData)
data.all <- c("transbig7g"=transbig7g, "unt7g"=unt7g, "upp7g"=upp7g,
              "mainz7g"=mainz7g, "nki7g"=nki7g)

很清楚的可以看到数据集如下:

代码语言:javascript
复制
> data.all
$transbig7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 198 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: VDXGUYU_4002 VDXGUYU_4008 ... VDXRHU_5240 (198 total)
  varLabels: samplename dataset ... e.os (21 total)
  varMetadata: labelDescription
featureData
  featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
  fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 17545524 
Annotation: hgu133a 
$unt7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 137 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: OXFU_104 OXFU_1065 ... KIU_89A64 (137 total)
  varLabels: samplename dataset ... e.os (21 total)
  varMetadata: labelDescription
featureData
  featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
  fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 16478745 
Annotation: hgu133ab 
$upp7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 251 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: UPP_103B41 UPP_104B91 ... UPP_9B52 (251 total)
  varLabels: samplename dataset ... e.os (21 total)
  varMetadata: labelDescription
featureData
  featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
  fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 16141321 
Annotation: hgu133ab 
$mainz7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 200 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: MAINZ_BC6001 MAINZ_BC6002 ... MAINZ_BC6232 (200 total)
  varLabels: samplename dataset ... e.os (21 total)
  varMetadata: labelDescription
featureData
  featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
  fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 18593943 
Annotation: hgu133a 
$nki7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 337 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: NKI_4 NKI_6 ... NKI_404 (337 total)
  varLabels: samplename dataset ... e.os (21 total)
  varMetadata: labelDescription
featureData
  featureNames: NM_000125 NM_004448 ... NM_004346 (7 total)
  fvarLabels: probe EntrezGene.ID ... Description (10 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: rosetta 

因为最后一个数据集是agilent公司的,前面的数据都是affy公司的芯片,所以可以拿它来练手批次效应的矫正算法。

代码语言:javascript
复制
dn <- c("transbig", "unt", "upp", "mainz", "nki")
dn.platform <- c("affy", "affy", "affy", "affy", "agilent")

参考:http://genomicsclass.github.io/book/pages/svacombat.html 及 https://www.biostars.org/p/196430/ 很容易看懂什么是批次矫正。

更重要的是这 5 个数据集的临床信息,都被重新归纳总结啦:

代码语言:javascript
复制
cinfo <- colnames(pData(mainz7g))
> cinfo
 [1] "samplename"    "dataset"       "series"        "id"           
 [5] "filename"      "size"          "age"           "er"           
 [9] "grade"         "pgr"           "her2"          "brca.mutation"
[13] "e.dmfs"        "t.dmfs"        "node"          "t.rfs"        
[17] "e.rfs"         "treatment"     "tissue"        "t.os"         
[21] "e.os"  

真的是非常棒的数据集!!!

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-05-15,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • GSE11121
  • GSE7390
  • GSE3494
  • GSE2990
  • 数据载入R
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档