GTEx,The Genotype-Tissue Expression (GTEx) project,首次被提出来是2013年,上百位科学家联名在Nature Genetics杂志发表的文章首次介绍了“基因型-组织表达工程”,并成立了“基因型-组织表达研究联盟”。The GTEx has catalogued gene expression in >9,000 samples across 53 tissues from 544 healthy individuals.
TCGA,The cancer genome altas,https://cancergenome.nih.gov/ ,是由National Cancer Institute ( NCI, 美国国家癌症研究所) 和 National Human Genome Research Institute (NHGRI, 国家人类基因组研究所) 合作建立的癌症研究项目,通过收集整理癌症相关的各种组学数据。The Cancer Genome Atlas (TCGA) has quantified gene expression levels in >12000 samples from >33 cancer types.
其实是没办法简单的回答是否可以整合TCGA和GTEx数据库,或者说该如何结合,这背后的统计学略微有点复杂,不仅仅是批次效应。发表在Sci Data. 2018; 的文章:Unifying cancer and normal RNA sequencing data from different sources 就比较详细的说明了TCGA和GTEx数据库的转录组数据的天然差异:
sequencing platform and chemistry, personnel, details in the analysis pipeline, etc
基因表达量范围:4-10 (log2 of normalized_count) for TCGA, and 0-4 (log2 of RPKM) for GTEx
最近一篇发表在SR,17 February 2020 的文章:Variability in estimated gene expression among commonly used RNA-seq pipelines 比较了常见转录组测序数据分析流程对定量拿到的表达矩阵的影响:
We compared gene expression values from common samples (4,800 tumor samples from TCGA and 1,890 normal-tissue samples from GTEx) processed by the pipelines to understand how gene expression quantification is impacted by differences in data processing.
很多简陋的数据挖掘,比如发表在PeerJ的 BIOINFORMATICS AND GENOMICS杂志的文章:Identification of four hub genes associated with adrenocortical carcinoma progression by WGCNA 也会涉及到TCGA数据库和GTEx的整合。
首先下载TCGA和GTEx数据库的TPM表达矩阵:
Gene transcripts per million (TPM) data were downloaded from the UCSC Xena database, which included ACC (The Cancer Genome Atlas, n = 77) and normal samples (Genotype Tissue Expression, n = 128).
然后差异分析流程是:
Of the 60,498 genes in each sample, we removed genes with a mean TPM ≤ 2.5 (>1 is a common cutoff for determining if an isoform is expressed or not in the cancer and normal samples and thus retained 13,987 genes.
For those genes in the samples that showed significant changes, we used analysis of variance (ANOVA) in R to determine the variance in genes between the two groups. ANOVA is a collection of statistical models useful for DEG analysis.
We obtained 2,953 significant DEGs (Table S2) in ACC with a p < 0.001 and |log2 (fold-change)| > 1 cutoff.
差异分析结果是:1,181 up-regulated and 1,772 down-regulated genes.