前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >单细胞转录组分析RNA velocity-STAR solo

单细胞转录组分析RNA velocity-STAR solo

原创
作者头像
旋转木马
发布2023-12-01 14:34:45
8730
发布2023-12-01 14:34:45
举报
文章被收录于专栏:生信小白的专栏

STAR较CellRanger有着更快的运行速度和更广泛的运用场景。单细胞转录组的比对可以通过STAR-solo来实现,在solo Features 可以同时获取Gene expression和RNA velocity 信息,帮助我们进行拟时序分析。

本文将记录从单细胞转录组Raw data(fastq.files)到用于下游分析的基因表达矩阵和RNA velocity的剪切信息

1.0 构建Index & 下载whitelist

构建基因检索是每个比对软件的第一步

STAR构建检索同样需要下载特定物种的FASTQ基因组文件和GTF注释文件,这里以mm10为例

代码语言:shell
复制
STAR --runThreadN 6 \ #设置线程
--runMode genomeGenerate \ #构建参考index
--genomeDir /data/users/Alignment/index/mm10 \ #index输出路径
--genomeFastaFiles /data/users/Alignment/mm10/mm10.fa \ #参考基因组Fasta文件路径
--sjdbGTFfile /data/users/Alignment/mm10/mm10.gtf 、 #参考基因组注释GTF文件路径
--sjdbOverhang 149 #reads长度最大值减一,默认100,我已知我的reads最大长度150 该参数也可以不设置

构建index后就可以放心比对了 哈哈!

10X - Whitelist 获取

whitelist 是10X 单细胞测序中barcodes 的白名单需要在后续比对中添加,以免识别到异常的细胞,根据不同版本的建库检测方法有着不同的白名单,要明确对应,切记!!

10X白名单相关参数
10X白名单相关参数

UMILEN 和 STR也是非常重要的参数后续进行比对时,也要一一设定。

1. whitelist 可以在10XGENOMICSCellRanger下载

2.在CellRanger的本地文件中也可以找到whitelist文件

代码语言:shell
复制
cd /software/cellranger-7.1.0/lib/python/cellranger/barcodes
ll *.gz
cp *.gz STAR/BC
gunzip *.gz
white list
white list

2.0 STAR 比对

STAR 较Cell ranger有着更加个性化的参数,根据需要自行调整

代码语言:shell
复制
# 官网例子
# Set the number of threads (CPUs) to be used
STAR --runThreadN $CPUS 
# Specify the directory containing the indexed genome files
--genomeDir $REF 
# Specify the input read files for mapping
--readFilesIn $R1 $R2 
# Set directory permissions for output files (All_RWX) and specify file formats (GZIP, BAM)
--runDirPerm All_RWX $GZIP $BAM 
# Enable solo (single-cell) analysis options
--soloBarcodeMate 1 
--clip5pNbases 39 0 
--soloType CB_UMI_Simple 
--soloCBwhitelist $BC 
--soloCBstart 1 
--soloCBlen $CBLEN 
--soloUMIstart $((CBLEN+1)) 
--soloUMIlen $UMILEN 
--soloStrand Forward 
--soloUMIdedup 1MM_CR 
--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts 
--soloUMIfiltering MultiGeneUMI_CR 
--soloCellFilter EmptyDrops_CR 
# Set minimum alignment score threshold
--outFilterScoreMin 30 
# Specify the features to be included in the output (Gene, GeneFull, Velocyto)
--soloFeatures Gene Velocyto 
# Specify output file names for solo analysis
--soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx 
# Enable handling of multiple mappers using the Expectation-Maximization (EM) algorithm
--soloMultiMappers EM 
# Specify how unmapped reads should be output (Fastx format)
--outReadsUnmapped Fastx
代码语言:shell
复制
STAR --runThreadN 10 --soloType CB_UMI_Simple --soloFeatures Gene Velocyto \
--soloCBstart 1 \
--soloCBlen 16 \
--soloUMIstart 17 \
--soloUMIlen 12 \
--soloBarcodeReadLength 0 \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM CR CY UR UY \
--soloOutFileNames Solo.out/ genes.tsv barcodes.tsv matrix.mtx \
--outFileNamePrefix ./out \
--genomeDir index/mm10 \
--soloCBwhitelist /software/STAR/BC/3M-february-2018.txt \
--readFilesIn CRRXXX_r2.fastq.gz CRRXXX_f1.fastq.gz

得到结果文件

STAR 结果文件
STAR 结果文件

3.0基于STAR结果进行Velocity分析

基于Python scVelo进行拟时序分析

Velocity的信息存储在Velocyto/raw/matrix.mtx中需要提取整合

代码语言:python
代码运行次数:0
复制
# Load required libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from scipy import sparse
import anndata
import scanpy as sc
import scvelo as scv
import diopy
scv.set_figure_params()
os.chdir('/path/to/work/dir')
def buildAnndataFromStar(path):
    """Generate an anndata object from the STAR aligner output folder"""
    path=path
    # Load Read Counts
    X = sc.read_mtx(path+'Gene/raw/matrix.mtx')
    
    # Transpose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
    X = X.X.transpose()
    
    # This matrix is organized as a sparse matrix with Row, Cols and 3 values columns for 
    # Spliced, Unspliced and Ambigous reads
    mtx = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=3, delimiter=' ')
    # Extract sparse matrix shape informations from the third row
    shape = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)

    # Read the sparse matrix with csr_matrix((data, (row_ind, col_ind)), shape=(M, N))
    # Subract -1 to rows and cols index because csr_matrix expects a 0 based index
    # Traspose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
    spliced = sparse.csr_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
    unspliced = sparse.csr_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
    ambiguous = sparse.csr_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
    
    # Load Genes and Cells identifiers
    obs = pd.read_csv(path+'Velocyto/raw/barcodes.tsv',
                  header = None, index_col = 0)
    # Remove index column name to make it compliant with the anndata format
    obs.index.name = None
    
    var = pd.read_csv(path+'Velocyto/raw/features.tsv', sep='\t',
                  names = ('gene_ids', 'feature_types'), index_col = 1)
    
    # Build AnnData object to be used with ScanPy and ScVelo
    adata = anndata.AnnData(X = X, obs = obs, var = var,
                        layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})
    adata.var_names_make_unique()
    
    # Subset Cells based on STAR filtering 根据自己需要调整该参数
    # selected_barcodes = pd.read_csv(path+'Gene/filtered/barcodes.tsv', header = None)
    # adata = adata[selected_barcodes[0]]
    
    return adata.copy()
代码语言:python
代码运行次数:0
复制
adata = buildAnndataFromStar('/sc-seq/CRRXXX/outSolo.out/')
# QC 步骤省略一万步
diopy.output.write_h5(adata,'data1.h5') #输出adata对象
scv.pl.proportions(adata)
spliced情况
spliced情况

就可以愉快的分析了 哈哈!

https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md(STAR-solo github) https://zhuanlan.zhihu.com/p/362727395(构建STATR-index)

https://github.com/alexdobin/STAR/issues/774(Velocity整合)

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 本文将记录从单细胞转录组Raw data(fastq.files)到用于下游分析的基因表达矩阵和RNA velocity的剪切信息
  • 1.0 构建Index & 下载whitelist
    • 构建基因检索是每个比对软件的第一步
      • 10X - Whitelist 获取
      • 2.0 STAR 比对
      • 3.0基于STAR结果进行Velocity分析
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档