不同物种的的10x单细胞转录组参考数据文件构建

生信技能树

发布于 2022-03-03 14:09:59

1.1K0

发布于 2022-03-03 14:09:59

文章被收录于专栏：生信技能树

除夕夜我们介绍了 一个10x单细胞转录组项目从fastq到细胞亚群，演示的10x单细胞转录组项目是人类数据，但是也有不少朋友表示他们的物种比较特殊，需要额外构建index而不能从10x的官网下载，我们也演示一下，争取可以帮助大家。

10x单细胞转录组数据分析所需要的参考数据文件主要是基因组的fasta文件和基因注释gtf文件，其官网有详细的例子：https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr

官网有两个例子：

Reference build instructions: Rhesus macaque
Reference build instructions: Norwegian rat

下面我们以大鼠和猪为例子来演示：

咦 rat ftp.ensembl.org 为关键词组合，可以进入不同物种的官网，比如这个大鼠：https://asia.ensembl.org/Rattus_norvegicus/Info/Index

可以看到其基因组的fasta文件和基因注释gtf文件的不同下载地址：

http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/dna/
http://ftp.ensembl.org/pub/release-105/gtf/rattus_norvegicus/

规律非常明显，对猪来说也是如此，不同物种有不同的拉丁名而已。前单细胞转录组以10X公司为主流，我们也是在单细胞天地公众号详细介绍了cellranger全部使用细节及流程，大家可以自行前往学习，如下：

但是这个两年前的系列笔记是基于V2,V3版本的cellranger，在2020的7月我看到了其更新到了V4，也里面写了一个总结，见：cellranger更新到4啦（全新使用教程）

如果是从头开始构建index，每个物种的两个文件（基因组的fasta文件和基因注释gtf文件）都需要下载，代码如下所示：

nohup wget http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz & 
nohup wget http://ftp.ensembl.org/pub/release-105/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.105.gtf.gz & 
# 两个gz文件都需要 解压， 命令是 gunzip

如果物种是猪，下载基因组及注释文件的代码如下所示：

wget -c http://ftp.ensembl.org/pub/release-105/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
wget -c http://ftp.ensembl.org/pub/release-105/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.105.gtf.gz
wget -c http://ftp.ensembl.org/pub/release-105/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.105.chr.gtf.gz

gunzip Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
gunzip Sus_scrofa.Sscrofa11.1.105.gtf.gz
gunzip Sus_scrofa.Sscrofa11.1.105.chr.gtf.gz

其它物种类似，都是去 ftp.ensembl.org 搜索各个物种对应的两个文件（基因组的fasta文件和基因注释gtf文件），下载并且解压即可。

首先过滤gtf文件

参考官网的的标准代码即可，如下所示：

cellranger=/home/data/bylin/cellranger-6.1.2/cellranger
$cellranger mkgtf \
  Rattus_norvegicus.mRatBN7.2.105.gtf Rattus_norvegicus.mRatBN7.2.105.filtered.gtf \
  --attribute=gene_biotype:protein_coding \
  --attribute=gene_biotype:lincRNA \
  --attribute=gene_biotype:antisense \
  --attribute=gene_biotype:IG_LV_gene \
  --attribute=gene_biotype:IG_V_gene \
  --attribute=gene_biotype:IG_V_pseudogene \
  --attribute=gene_biotype:IG_D_gene \
  --attribute=gene_biotype:IG_J_gene \
  --attribute=gene_biotype:IG_J_pseudogene \
  --attribute=gene_biotype:IG_C_gene \
  --attribute=gene_biotype:IG_C_pseudogene \
  --attribute=gene_biotype:TR_V_gene \
  --attribute=gene_biotype:TR_V_pseudogene \
  --attribute=gene_biotype:TR_D_gene \
  --attribute=gene_biotype:TR_J_gene \
  --attribute=gene_biotype:TR_J_pseudogene \
  --attribute=gene_biotype:TR_C_gene

可以看到这个过滤步骤前后其实gtf文件变化并不多。

然后构建参考基因组index

标准代码即可，需要使用上一步过滤好的gtf文件，加上前面下载的参考基因组fasta文件哦，如下所示：

$cellranger mkref \
  --genome=mRatBN7 \
  --fasta=Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa \
  --genes=Rattus_norvegicus.mRatBN7.2.105.filtered.gtf \
  --ref-version=1.0.0

如下所示的log日志：

['/home/data/bylin/cellranger-6.1.2/bin/rna/mkref', '--genome=mRatBN7', '--fasta=Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa', '--genes=Rattus_norvegicus.mRatBN7.2.105.filtered.gtf', '--ref-version=1.0.0']
Creating new reference folder at /home/data/x9/zhao/rat_ref/mRatBN7
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
Feb 01 09:40:54 ..... started STAR run
Feb 01 09:40:54 ... starting to generate Genome files
Feb 01 09:42:08 ... starting to sort Suffix Array. This may take a long time...
Feb 01 09:42:14 ... sorting Suffix Array chunks and saving them to disk... 
Feb 01 10:17:13 ... completed Suffix Array index
Feb 01 10:17:13 ..... processing annotations GTF
Feb 01 10:17:23 ..... inserting junctions into the genome indices 
Feb 01 10:24:00 ... writing Genome to disk ...
Feb 01 10:24:03 ... writing Suffix Array to disk ...
Feb 01 10:24:10 ... writing SAindex to disk
Feb 01 10:24:12 ..... finished successfully
...done.

Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done

Computing hash of genes GTF file...
...done

...done

>>> Reference successfully created! <<<

You can now specify this reference on the command line:
cellranger --transcriptome=/home/data/x9/zhao/rat_ref/mRatBN7

是不是很简单啊，下次我们跑cellranger流程的时候就可以直接使用 /home/data/x9/zhao/rat_ref/mRatBN7 这个index的路径地址哦。参考前面的 一个10x单细胞转录组项目从fastq到细胞亚群，

查看构建好的index

如下所示：

 tree -h /home/data/x9/zhao/rat_ref/mRatBN7
/home/data/x9/zhao/rat_ref/mRatBN7
├── [4.0K]  fasta
│   ├── [2.5G]  genome.fa
│   └── [6.5K]  genome.fa.fai
├── [4.0K]  genes
│   └── [ 19M]  genes.gtf.gz
├── [ 456]  reference.json
└── [4.0K]  star
    ├── [1.1K]  chrLength.txt
    ├── [3.6K]  chrNameLength.txt
    ├── [2.5K]  chrName.txt
    ├── [1.9K]  chrStart.txt
    ├── [ 17M]  exonGeTrInfo.tab
    ├── [7.5M]  exonInfo.tab
    ├── [970K]  geneInfo.tab
    ├── [2.5G]  Genome
    ├── [ 815]  genomeParameters.txt
    ├── [7.1G]  SA
    ├── [1.5G]  SAindex
    ├── [6.2M]  sjdbInfo.txt
    ├── [6.1M]  sjdbList.fromGTF.out.tab
    ├── [4.9M]  sjdbList.out.tab
    └── [3.0M]  transcriptInfo.tab

3 directories, 19 files

大家也可以去看看自己的物种index文件构建情况哦！

如果是猪，全部的流程如下所示:

bin=/home/x10/pipeline/cellranger-6.0.0/bin/cellranger
$bin mkref \
--genome=Sus_scrofa_genome \
--fasta=Sus_scrofa.Sscrofa11.1.dna.toplevel.fa \
--genes=Sus_scrofa.Sscrofa11.1.105.chr.gtf

输出结果logo日志如下所示：

['/home/data/x10/pipeline/cellranger-6.0.0/bin/rna/mkref', '--genome=Sus_scrofa_genome', '--fasta=Sus_scrofa.Sscrofa11.1.dna.toplevel.fa', '--genes=Sus_scrofa.Sscrofa11.1.105.chr.gtf']
Jan 23 23:09:57 ..... started STAR run
Jan 23 23:09:57 ... starting to generate Genome files
Jan 23 23:11:08 ... starting to sort Suffix Array. This may take a long time...
Jan 23 23:11:13 ... sorting Suffix Array chunks and saving them to disk...
Jan 23 23:36:05 ... loading chunks from disk, packing SA...
Jan 23 23:36:29 ... finished generating suffix array
Jan 23 23:36:29 ... generating Suffix Array index
Jan 23 23:38:43 ... completed Suffix Array index
Jan 23 23:38:43 ..... processing annotations GTF
Jan 23 23:38:55 ..... inserting junctions into the genome indices
Jan 23 23:45:22 ... writing Genome to disk ...
Jan 23 23:45:24 ... writing Suffix Array to disk ...
Jan 23 23:45:31 ... writing SAindex to disk
Jan 23 23:45:33 ..... finished successfully
Creating new reference folder at /home/data/x10/run/pig/database/Sus_scrofa_genome
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
...done.

Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done

Computing hash of genes GTF file...
...done

...done

>>> Reference successfully created! <<<

You can now specify this reference on the command line:
cellranger --transcriptome=/home/data/x10/run/pig/database/Sus_scrofa_genome ...

大同小异的，所以这个数据分析流程主要是取决于物种的两个文件（基因组的fasta文件和基因注释gtf文件），如果物种都没有比较好的注释，就麻烦了。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-02-02，如有侵权请联系 cloudcommunity@tencent.com 删除

ftp