前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >GEO 数据挖掘-数据获得

GEO 数据挖掘-数据获得

作者头像
火星娃统计
发布2020-09-15 15:29:50
1.8K0
发布2020-09-15 15:29:50
举报
文章被收录于专栏:火星娃统计火星娃统计

GEO 数据挖掘-数据获得

1. 概述

NCBI Gene Expression Omnibus(GEO)是各种高通量实验数据的公共存储库,这些数据包括测量mRNA、基因组DNA和蛋白质丰度的单通道和双通道微阵列实验,以及非阵列技术,如基因表达序列分析(SAGE)、质谱蛋白质组数据和高通量测序数据。相比较TCGA数据库,因为数据是用户上传,所以更新较快

需要知道四个单词 1. GEO Platform (GPL) 芯片平台

  1. GEO Sample (GSM) 样本ID号
  2. GEO Series (GSE) study的ID号
  3. GEO Dataset (GDS) 数据集的ID号

2. 配置包

代码语言:javascript
复制
# 设置镜像,用于下载
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))

# 安装
# install.packages("BiocManager")
# BiocManager::install("GEOquery")

3. 下载数据

代码语言:javascript
复制
# 加载
library(GEOquery)
#使用getGEO函数获得基因信息
gds <- getGEO("GDS507")# 下载
# 同时支持从本地获得
# gds <- getGEO(filename=system.file("路径",package="GEOquery"))
# 下载gsm数据
gsm <- getGEO("GSM11805")

4. GEOquery数据结构

GEOquery数据结构实际上有两种形式。第一种,包括GDS、GPL和GSM,第二种是GSE是,由GSM和GPL对象组合而成的复合数据类型。

4.1 GDS, GSM, 和 GPL

代码语言:javascript
复制
# 通过meta查看gsm文件,显示样本的信息
head(Meta(gsm))
代码语言:javascript
复制
## $channel_count
## [1] "1"
##
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
代码语言:javascript
复制
# 查看前5行数值
Table(gsm)[1:5,]
代码语言:javascript
复制
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P
代码语言:javascript
复制
# 查看列名
Columns(gsm)
代码语言:javascript
复制
##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065

对于gpl的格式和gsm是一样的,另外GDS的信息要多于上述两种

代码语言:javascript
复制
Columns(gds)[1:5,1:3]
代码语言:javascript
复制
##     sample disease.state individual
## 1 GSM11815           RCC        035
## 2 GSM11832           RCC        023
## 3 GSM12069           RCC        001
## 4 GSM12083           RCC        005
## 5 GSM12101           RCC        011

4.2 GSE格式

GSE格式较为混乱,它可以表示任意平台的样本,gse有一个元数据部分,它没有DataTable。而是包含两个列表,可以使用GPLList和GSMList方法访问。

代码语言:javascript
复制
gse <- getGEO("GSE781",GSEMatrix=FALSE)
# 查看meta
head(Meta(gse))
代码语言:javascript
复制
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
##
## $contact_fax
## [1] "617-414-1646"

查看样本id

代码语言:javascript
复制
names(GSMList(gse))
代码语言:javascript
复制
##  [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
##  [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"

获取第一个gsm样本

代码语言:javascript
复制
GSMList(gse)[[1]]
代码语言:javascript
复制
## An object of class "GSM"
## channel_count
## [1] "1"
## contact_address
## [1] "715 Albany Street, E613B"
## contact_city
## [1] "Boston"
## contact_country
## [1] "USA"
## contact_department
## [1] "Genetics and Genomics"
## contact_email
## [1] "mlenburg@bu.edu"
## contact_fax
## [1] "617-414-1646"
## contact_institute
## [1] "Boston University School of Medicine"
## contact_name
## [1] "Marc,E.,Lenburg"
## contact_phone
## [1] "617-414-1375"
## contact_state
## [1] "MA"
## contact_web_link
## [1] "http://gg.bu.edu"
## contact_zip/postal_code
## [1] "02130"
## data_row_count
## [1] "22283"
## description
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"
## [3] "Keywords = renal"
## [4] "Keywords = RCC"
## [5] "Keywords = carcinoma"
## [6] "Keywords = cancer"
## [7] "Lot batch = 2004638"
## geo_accession
## [1] "GSM11805"
## last_update_date
## [1] "May 28 2005"
## molecule_ch1
## [1] "total RNA"
## organism_ch1
## [1] "Homo sapiens"
## platform_id
## [1] "GPL96"
## series_id
## [1] "GSE781"
## source_name_ch1
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status
## [1] "Public on Nov 25 2003"
## submission_date
## [1] "Oct 20 2003"
## supplementary_file
## [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM11nnn/GSM11805/suppl/GSM11805.CEL.gz"
## taxid_ch1
## [1] "9606"
## title
## [1] "N035 Normal Human Kidney U133A"
## type
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P
## 22278 more rows ...

查看平台信息

代码语言:javascript
复制
names(GPLList(gse))
代码语言:javascript
复制
## [1] "GPL96" "GPL97"

5. 转换为bioconductor ExpressionSets 和 limma MALists

5.1 获得GSE Series Matrix数据

代码语言:javascript
复制
# 获得数据集,格式为gse
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
# 显示具体信息
show(gse2553)
代码语言:javascript
复制
## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
##   varLabels: title geo_accession ... data_row_count (30 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 12600 (12600 total)
##   fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 16230383
## Annotation: GPL1977

5.2 将GDS转换为ExpressionSet

代码语言:javascript
复制
eset <- GDS2eSet(gds,do.log2=TRUE)
eset
代码语言:javascript
复制
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
##   varLabels: sample disease.state individual description
##   varMetadata: labelDescription
## featureData
##   featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
##   fvarLabels: ID Gene title ... GO:Component ID (21 total)
##   fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 14641932
## Annotation:

此时eset就是一个表达矩阵

5.3 转换 GDS 为 MAList

代码语言:javascript
复制
Meta(gds)$platform
代码语言:javascript
复制
## [1] "GPL97"
代码语言:javascript
复制
# 选择
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
MA <- GDS2MA(gds,GPL=gpl)
class(MA)
代码语言:javascript
复制
## [1] "MAList"
## attr(,"package")
## [1] "limma"

5.4 GSE转换为 ExpressionSet

查看平台

代码语言:javascript
复制
gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)
代码语言:javascript
复制
## $GSM11805
## [1] "GPL96"
##
## $GSM11810
## [1] "GPL97"
##
## $GSM11814
## [1] "GPL96"
##
## $GSM11815
## [1] "GPL97"
##
## $GSM11823
## [1] "GPL96"
##
## $GSM11827
## [1] "GPL97"

通过fileter,筛选GPL96平台的数据

代码语言:javascript
复制
gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)
代码语言:javascript
复制
## [1] 17

查看测序数据

代码语言:javascript
复制
Table(gsmlist[[1]])[1:5,]
代码语言:javascript
复制
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P

查看描述

代码语言:javascript
复制
Columns(gsmlist[[1]])[1:5,]
代码语言:javascript
复制
##        Column
## 1      ID_REF
## 2       VALUE
## 3    ABS_CALL
## NA       <NA>
## NA.1     <NA>
##                                                                     Description
## 1
## 2                            MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3    MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
## NA                                                                         <NA>
## NA.1                                                                       <NA>
代码语言:javascript
复制
# 探针
probesets <- Table(GPLList(gse)[[1]])$ID
# 提取每个样本的value
# wo ye kan bu dong 大致是通过lapply函数对每一个
# gsm进行匹配,然后返回value,最后cbind
data.matrix <- do.call('cbind',lapply(gsmlist,function(x)
                                      {tab <- Table(x)
                                       mymatch <- match(probesets,tab$ID_REF)
                                       return(tab$VALUE[mymatch])
                                     }))
# 转换为数字
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,1:5]
代码语言:javascript
复制
##       GSM11805  GSM11814  GSM11823  GSM11830  GSM12067
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376
## [2,]  5.749534  7.908092  7.093814  7.514122  7.901470
## [3,]  7.066089  7.750205  7.244126  7.962896  7.337176
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568
## [5,]  6.195741  6.061776  6.565293  6.583459  6.877744

对data进一步处理

代码语言:javascript
复制
require(Biobase)
# 行列命名
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
# 转换为data.frame
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
代码语言:javascript
复制
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
##   varLabels: samples
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

获取平台的信息

代码语言:javascript
复制
# gpl97 <- getGEO('GPL97')
# Meta(gpl97)$title

通过上述代码可以获得这个平台的所有信息,并通过id等查看,下载比较费时,不做展示

结束语

关于GEO数据挖掘的第一步,获得数据GEOquery的学习已经完成,没有学过关于测序的知识,这些信息获得之后还是懵逼的,2020-7-10更新 love&peace

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-07-10,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 火星娃统计 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • GEO 数据挖掘-数据获得
    • 1. 概述
      • 2. 配置包
        • 3. 下载数据
          • 4. GEOquery数据结构
            • 4.1 GDS, GSM, 和 GPL
            • 4.2 GSE格式
          • 5. 转换为bioconductor ExpressionSets 和 limma MALists
            • 5.1 获得GSE Series Matrix数据
            • 5.2 将GDS转换为ExpressionSet
            • 5.3 转换 GDS 为 MAList
            • 5.4 GSE转换为 ExpressionSet
          • 获取平台的信息
            • 结束语
            领券
            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档