前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >tsRNA分类分析:seqac包

tsRNA分类分析:seqac包

作者头像
生信菜鸟团
发布2023-09-08 12:26:57
3611
发布2023-09-08 12:26:57
举报
文章被收录于专栏:生信菜鸟团

tRNA的二级结构如下:

image-20230903154931207

根据其二级结构,tsRNA分类如下:

早期研究报道,tiRNAs通过血管生成素(ANG)切割成熟tRNA产生,而tRFs来源于Dicer或ANG切割成熟tRNA或tRNA前体。根据成熟或前体tRNA转录本中的裂解位点,可以分为以下几类:

  • tRF-1:来源于前体tRNA,在3' trailer部由RNase Z or ELAC2酶裂解产生。
  • tRF-3:来自成熟体tRNA的3'端
  • tRF-5:来自成熟体tRNA的5'端
  • tRF-i:来自成熟体tRNA的中间区域
  • tiRNAs:are generated by specific cleavage by angiogenin in the anticodon loops of mature tRNAs

image-20230619222644080

今天主要给大家介绍一个软件,可以根据tsRNA的序列以及tRNA的二级结构文件来鉴定tsRNA的分类,软件为Seqpac。

Seqpac主要用于small RNA Sequence数据分析,教程最新版链接:

https://www.bioconductor.org/packages/release/workflows/vignettes/seqpac/inst/doc/seqpac_-_A_guide_to_sRNA_analysis_using_sequence-based_counts.html

软件分析输入为fq文件,seqpac workflow的核心对象为:

The PAC object:这个对象有三个部分,Phenotypic information (P), Annotations (A) and Counts (C)。

这个PAC对象有两种格式, S3 list或者S4,并且可以使用函数进行互转,as(pac-object, "list")将S4转为S3, as.PAC(pac-object)将S3转为S4。

The Targeting objects:用于访问PAC对象内的特定数据子集。

seqpac的分析步骤主要为四步:

  • Generate a PAC object from fastq-files
  • Preprocess the PAC object
  • Annotate the PAC object
  • Analyze the PAC object.

示例数据

seqpac包中自带了一个示例数据,6 sRNA fastq文件。

image-20230823222638876

样本组织类型为果蝇胚胎,测序平台为Illumina NextSeq 500,测序读长为75个碱基。这里抽取原始fq的部分read进行演示。

安装包:

代码语言:javascript
复制
## Installation
devtools::install_github("Danis102/seqpac", upgrade="never", 
 build_manual=TRUE, build_vignettes=TRUE)

首先快速一览整个包的分析步骤:

代码语言:javascript
复制
# Setup
 library(seqpac)
 
 sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
 fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
                 full.names = TRUE)
 ref_tRNA <- system.file("extdata/trna", "tRNA.fa",
                         package = "seqpac", mustWork = TRUE)
 
 # Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts)
 count_list  <- make_counts(fastq, plot = FALSE, 
                             parse="default_neb", trimming="seqpac")
 pac <- make_PAC(count_list$counts)
 pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table
 
 # Preprocess PAC-object and creat means
 pac <- PAC_norm(pac)                        # Default normalization is cpm
 pac <- PAC_filter(pac, size=c(15,60))       # Here, filter on sequence length
 pac <- PAC_summary(pac, norm = "cpm", type = "means", 
                    pheno_target=list("groups", unique(pheno(pac)$groups)))
 
 # Quickly align (annotate) and plot PAC-object
 map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE)
 plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups"))
 plts[[13]]

详细步骤

1.首先加载fq数据

代码语言:javascript
复制
# Use the ShortRead-package to randomly sample a fastq file
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
fq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE, full.names = TRUE)
fq
[1] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq1.fastq.gz"
[2] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq2.fastq.gz"
[3] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq3.fastq.gz"
[4] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq4.fastq.gz"
[5] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq5.fastq.gz"
[6] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq6.fastq.gz"

2.生成count矩阵

使用make_counts函数

代码语言:javascript
复制
count_list <- make_counts(input=fq, trimming="seqpac", threads=16, parse="default_neb")

# 查看count_list的结构
str(count_list, max.level = 1)

List of 3
 $ counts         :'data.frame': 8300 obs. of  6 variables:
 $ progress_report:'data.frame': 6 obs. of  19 variables:
 $ evidence_plots :List of 2

3.构造样本表型矩阵

使用make_pheno函数

代码语言:javascript
复制
Sample_ID <- colnames(count_list$counts)
pheno_fl <- data.frame(Sample_ID=Sample_ID,
                    Treatment=c(rep("heat", times=1), 
                                rep("control", times=2)),
                    Batch=rep(c("1", "2", "3"), times=1))

pheno <- make_pheno(pheno=pheno_fl, 
                    progress_report=count_list$progress_report, 
                    counts=count_list$counts)

每个样本的表型为一个数据框:

image-20230903161349444

4.构造PAC对象

代码语言:javascript
复制
###--------------------------------------------------------------------- 
## Generate PAC object (default S4)
pac_master <- make_PAC(pheno=pheno, counts=count_list$counts)
pac_master

PAC object with: 
   6 samples
   8300 sequences
   mean total counts: 4410 (min:4346/max:4476)
   best sequence: 230 mean counts
   worst sequence: 0 mean counts

至此,数据的分析对象已经构建好了,PAC对象,后续的分析基本上都是对这个PAC进行操作。

tRNA class analysis

我们本次主要关注tRNA class analysis,此包的其他分析可以见分析教程。

首先加载包中内置的PAC对象

代码语言:javascript
复制
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))
                 
# 查看加载的对象
pac

比对:

这个包内置比对工具为bwa,此处mismatches为0,还讲序列上下增加了3bp,参考序列为tRNA序列,来自GtRNAdb数据库:http://gtrnadb.ucsc.edu/。

代码语言:javascript
复制
###---------------------------------------------------------------------
## tRNA analysis in Seqpac 

# First create an annotation blank PAC with group means
anno(pac) <- anno(pac)[,1, drop=FALSE]
pac_trna <- PAC_summary(pac, norm = "cpm", type = "means", 
                        pheno_target=list("stage"), merge_pac = TRUE)

# Then reannotate only tRNA using the PAC_mapper function
ref <- "/some/path/to/trna/fasta/recerence.fa"  
# or use the provided fasta
ref <- system.file("extdata/trna2", "tRNA2.fa", 
                          package = "seqpac", mustWork = TRUE)          

map_object <- PAC_mapper(pac_trna, ref=ref, N_up = "NNN", N_down = "NNN", 
                         mismatches=0, threads=6, report_string=TRUE,
                         override=TRUE)

比对完后的map_object对象是一个list对象,如下:

image-20230903174938936

每一个tRNA上比对到的tsRNA如下,每一行为一个tsRNA的序列:

image-20230903175956563

先看看比对结果,PAC_covplot可以绘制每一个tRNA上的短片段tsRNA的比对情况:

代码语言:javascript
复制
# Hint: By adding N_up ad N_down you can make sure that modified fragments (like
# 3' -CAA in mature tRNA are included).

## Plot tRNA using xseq=TRUE gives you the reference sequence as X-axis:
# (OBS! Long reference will not show well.)                     
cov_tRNA <- PAC_covplot(pac_trna, map_object, 
                        summary_target = list("cpmMeans_stage"), 
                        xseq=TRUE, style="line", 
                        color=c("red", "black", "blue"))

选一个tRNA查看,这里对每类stage进行的绘制:

image-20230903175253710

也可以指定某个tRNA绘制查看:

代码语言:javascript
复制
# Targeting a single tRNA using a summary data.frame 
PAC_covplot(pac_trna, map=map_object, summary_target= list("cpmMeans_stage"), 
            map_target="tRNA-Lys-CTT-1-9")

结果如下:

image-20230903175431933

此外,还可以筛选出tRNA上比对read最多的进行展示:

代码语言:javascript
复制
# Find tRNAs with many fragments
# 1st extract number of rows from each alignment 
n_tRFs <- unlist(lapply(map_object, function(x){nrow(x[[2]])}))
# The test data is highly down an filterd sampled, but still some with tRNA have
# a few alignment
table(n_tRFs)
names(map_object)[n_tRFs>2]
# Lets select a few of them and plot them
selct <- names(map_object)[n_tRFs>2][c(1, 16, 27, 37)]
cov_plt <- PAC_covplot(pac_trna, map=map_object, 
                       summary_target= list("cpmMeans_stage"), 
                       map_target=selct)
                       
cowplot::plot_grid(plotlist=cov_plt, nrow=2, ncol=2)

选取的四个tRNA绘图结果如下:

image-20230903180248037

接下来就是解析每个tsRNA的分类了:

解理tsRNA的分类,如5 '和3 'halves等,更复杂。这种分类涉及到tRNA环(A-loop、anticodon loop和T-loop)中可能发生切割的位置。有一些外部软件可以从二级结构模型中预测环路。预测tRNA二级结构最流行的工具之一是tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/)。这个软件的输出是一个ss文件,它以以下格式存储循环信息:

image-20230903180717224

Ss-files for most common genomes are readily available for download at http://gtrnadb.ucsc.edu/。

根据ss文件,结合比对结果就可以使用tRNA_class函数对tsRNA进行分类注释了:

代码语言:javascript
复制
###---------------------------------------------------------------------
## Generate range types using a ss-file

# Download ss object from GtRNAdb 
# ("http://gtrnadb.ucsc.edu/")
ss_file <- "/some/path/to/GtRNAdb/trna.ss"
# or for testing use the provided ss-file in secpac
ss_file <- system.file("extdata/trna2", "tRNA2.ss", 
                          package = "seqpac", mustWork = TRUE)

# Classify fragments according to loop cleavage (small loops are omitted)       
map_object_ss <- map_rangetype(map_object, type="ss", 
                               ss=ss_file, min_loop_width=4)            

# Remove reference tRNAs with no hits
map_object_ss <-  map_object_ss[
  !unlist(lapply(map_object_ss, function(x){x[[2]][1,1] == "no_hits"}))]

# Now we have quite comprehensive tRNA loop annotations
names(map_object_ss)
map_object_ss[[1]][[2]]

# Don't forget to check ?map_rangetype to obtain more options 


###---------------------------------------------------------------------
## Function classifying 5'-tRF, 5'halves, i-tRF, 3'-tRF, 3'halves

# Does all tRNAs have 3 loops?
table(unlist(lapply(map_object_ss, function(x){unique(x$Alignments$n_ssloop)})))

# Set tolerance for classification as a terminal tRF
tolerance <- 5  # 2 nucleotides from start or end of full-length tRNA)

### Important:
# We set N_up and N_down to "NNN" in the PAC_mapper step. To make sure
# that we have a tolerance that include the original tRNA sequence
# we set terminal= 2+3 (5).  
## tRNA classifying function 
# Apply the tRNA_class function and make a tRNA type column
pac_trna <- tRNA_class(pac_trna, map=map_object_ss, terminal=tolerance)
## 
## -- Anno target was specified, will retain: 29 of 9131 seqs.
anno(pac_trna)$type <- paste0(anno(pac_trna)$decoder, anno(pac_trna)$acceptor)
anno(pac_trna)[1:5, 1:4] 

得到的分类结果如下:

image-20230903181248338

查看每种tsRNA类别的占比:

代码语言:javascript
复制
###---------------------------------------------------------------------
## Plot tRNA types

# Now use PAC_trna to generate some graphs based on grand means
trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL,
  join = TRUE, top = 15, log2fc = TRUE,
  pheno_target = list("stage", c("Stage1", "Stage3")), 
  anno_target_1 = list("type"),
  anno_target_2 = list("class"))
  
cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Grand_means,
               trna_result$plots$Log2FC_Anno_1,
               trna_result$plots$Percent_bars$Grand_means,
               nrow=1, ncol=3)

结果如下:

image-20230903181537012

每个tsRNA的最终注释结果如下:

代码语言:javascript
复制
anno_result <- anno(pac_trna)

image-20230903182329699

常见文献中tsRNA分类结果展示可以如下:

image-20230903182439352

ref:https://doi.org/10.1186/s12958-022-00978-3 Reproductive Biology and Endocrinology (2022) 20:106

根据这个分类注释做个性化绘制吧~

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2023-09-03,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信菜鸟团 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • seqpac的分析步骤主要为四步:
  • 示例数据
  • 详细步骤
  • 1.首先加载fq数据
  • 2.生成count矩阵
  • 3.构造样本表型矩阵
  • 4.构造PAC对象
  • tRNA class analysis
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档