前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >数据挖掘:是时候更新一下TCGA的数据了

数据挖掘:是时候更新一下TCGA的数据了

原创
作者头像
生信探索
发布2023-02-25 10:16:54
4490
发布2023-02-25 10:16:54
举报
文章被收录于专栏:生信探索

undefined

TCGA在去年更新之后提供了Count、TPM、FPKM三种格式的mRNA表达量数据,同时提供了ensembl gene ID、基因名、基因类型,因此有必要更新一下数据了。

按照需要的R包

代码语言:Python
复制
install.packages("tidyverse")
install.packages("arrow")
install.packages("data.table")
install.packages("magrittr")
install.packages("pacman")
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

TCGA数据版本信息

代码语言:Python
复制
rm(list = ls())
library(pacman)
p_load(magrittr, tidyverse, TCGAbiolinks, data.table, arrow)
TCGAbiolinks::getGDCInfo()
# $commit
# [1] "4dd3680528a19ed33cfc83c7d049426c97bb903b"
# $data_release
# [1] "Data Release 36.0 - December 12, 2022"
# $status
# [1] "OK"
# $tag
# [1] "3.0.0"
# $version
# [1] 1

建几个文件夹

代码语言:Python
复制
mkdir mRNA miRNA SNV CNV Protein

需要下载的数据

代码语言:Python
复制
gdc_projects <- TCGAbiolinks::getGDCprojects() %>%
  pull(id) %>%
  grep(pattern = "^TCGA", x = ., value = T) %>%
  str_remove("TCGA-")
gdc_projects
# [1] "CHOL" "LIHC" "DLBC" "BLCA" "ACC"  "CESC" "PCPG" "PAAD" "MESO" "TGCT"
# [11] "KIRP" "UVM"  "UCS"  "THYM" "COAD" "ESCA" "GBM"  "KICH" "HNSC" "PRAD"
# [21] "OV"   "LUSC" "LAML" "LGG"  "SARC" "BRCA" "READ" "LUAD" "STAD" "THCA"
# [31] "KIRC" "SKCM" "UCEC"

下载mRNA表达量数据

代码语言:Python
复制
downRNA <- function(cancer) {
  query <- TCGAbiolinks::GDCquery(
    project = paste0("TCGA-", cancer),
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts",
    legacy = FALSE
  )
  TCGAbiolinks::GDCdownload(query, files.per.chunk = 50)
  data <- TCGAbiolinks::GDCprepare(query, summarizedExperiment = F)
  data %<>% dplyr::filter(str_detect(gene_id, "^EN"))
  dt <- data %>% dplyr::select(gene_id, gene_name, gene_type, starts_with("unstranded"), starts_with("tpm"), starts_with("fpkm_unstranded"))
  colnames(dt) %<>% str_remove("_unstranded") %>% str_replace("unstranded", "count")
  arrow::write_ipc_file(dt, str_glue("mRNA/TCGA_{cancer}_mRNA.arrow", compression = "zstd", compression_level = 1))
  return(NULL)
}
walk(gdc_projects, downRNA)

下载其他几种数据的函数

代码语言:Python
复制
download <- function(
    cancer,
    folder_name,
    data_category = FALSE,
    data_type = FALSE,
    workflow_type = FALSE,
    experimental_strategy = FALSE,
    legacy = FALSE) {
  query <- TCGAbiolinks::GDCquery(
    project = paste0("TCGA-", cancer),
    data.category = data_category,
    data.type = data_type,
    experimental.strategy = experimental_strategy,
    workflow.type = workflow_type,
    legacy = legacy
  )
  TCGAbiolinks::GDCdownload(query, files.per.chunk = 50)
  TCGAbiolinks::GDCprepare(query = query, summarizedExperiment = FALSE) %>%
    arrow::write_ipc_file(., str_glue("{folder_name}/TCGA_{cancer}_{folder_name}.arrow", compression = "zstd", compression_level = 1))
}

下载microRNA表达量数据

代码语言:Python
复制
walk(gdc_projects, download, folder_name = "miRNA", data_category = "Transcriptome Profiling", data_type = "miRNA Expression Quantification", experimental_strategy = "miRNA-Seq")

下载SNV数据

代码语言:Python
复制
walk(gdc_projects, download, folder_name = "SNV", data_category = "Simple Nucleotide Variation", data_type = "Masked Somatic Mutation")

下载CNV 数据

代码语言:Python
复制
walk(gdc_projects, download, folder_name = "CNV", data_category = "Copy Number Variation", data_type = "Masked Copy Number Segment")

下载蛋白表达量数据

代码语言:Python
复制
walk(gdc_projects, download, folder_name = "Protein", data_category = "Proteome Profiling", data_type = "Protein Expression Quantification")

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 按照需要的R包
  • TCGA数据版本信息
  • 建几个文件夹
  • 需要下载的数据
  • 下载mRNA表达量数据
  • 下载其他几种数据的函数
  • 下载microRNA表达量数据
  • 下载SNV数据
  • 下载CNV 数据
  • 下载蛋白表达量数据
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档