RNA-seq数据分析完全指北-05：去接头以及过滤

生信菜鸟团

发布于 2021-04-13 14:49:07

6.1K00

代码可运行

文章被收录于专栏：生信菜鸟团生信菜鸟团

运行总次数：0

代码可运行

1、接头产生原理

一般来说，测序结果如下如所示，包括barcode和部分insert。而barcode部分在demultiplexing会被去除，剩下的就只有测到的一部分insert序列。

但是，在实际操作中，由于各种原因，可能会出现“测通”的情况，也就是一部分adapter序列也被测序仪读取到，这时就要进行去接头操作

2、去接头

我们此处使用trim_galore。当然，可以完成这一步的软件非常多，读者可以自行探索

2.1、trim_galore常用命令

(rna) csw@taikang-bio:~/projects/rna-seq/1.sra$ trim_galore --help

 USAGE:

trim_galore [options] <filename(s)>

-h/--help               Print this help message and exits.

-v/--version            Print the version information and exits.

-q/--quality <INT>      Trim low-quality ends from reads in addition to adapter removal. Default Phred score: 20.
## 碱基质量小于多少的reads会被去除，一般选择Q20，严格的话也可以选用Q30
--phred33               Instructs Cutadapt to use ASCII+33 quality scores as Phred scores (Sanger/Illumina 1.9+ encoding) for quality trimming. Default: ON.
## 默认fastq文件使用phred33
-a/--adapter <STRING>   Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will
                        try to auto-detect whether the Illumina universal, Nextera transposase or Illumina
                        small RNA adapter sequence was used. 
## 如果没有指定接头序列，trim_galore会自动检测接头序列并去除
--stringency <INT>      Overlap with adapter sequence required to trim a sequence. Defaults to a very stringent setting of 1.
## 在3‘末端至少有几个连续碱基被认为是接头序列之后会被去除，默认是1个
-e <ERROR RATE>         Maximum allowed error rate (no. of errors divided by the length of the matching region) (default: 0.1)
## 允许的错误率是多少
--length <INT>          Discard reads that became shorter than length INT because of either
                        quality or adapter trimming. A value of '0' effectively disables
                        this behaviour. Default: 20 bp.

                        For paired-end files, both reads of a read-pair need to be longer than
                        <INT> bp to be printed out to validated paired-end files (see option --paired).
                        If only one read became too short there is the possibility of keeping such
                        unpaired single-end reads (see --retain_unpaired). Default pair-cutoff: 20 bp.
## 在去除接头之后，长度小于这个值的reads将被舍弃。对于PE测序，一个pair两个reads的长度都要大于这个值。当然也可以保留其中超过阈值的一条而舍弃另一条。
-o/--output_dir <DIR>   If specified all output will be written to this directory instead of the current
                        directory. 
## 输出文件夹，不指定会自动创建
-j/--cores INT          Number of cores to be used for trimming [default: 1]. 
## 多线程运行
--paired                This option performs length trimming of quality/adapter/RRBS trimmed reads for
                        paired-end files. To pass the validation test, both sequences of a sequence pair
                        are required to have a certain minimum length which is governed by the option
                        --length (see above). If only one read passes this length threshold the
                        other read can be rescued (see option --retain_unpaired). Using this option lets
                        you discard too short read pairs without disturbing the sequence-by-sequence order
                        of FastQ files which is required by many aligners.
## PE测序需要指定这个参数
--retain_unpaired       If only one of the two paired-end reads became too short, the longer
                        read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
                        output files. The length cutoff for unpaired single-end reads is
                        governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.
## 保留pair中通过length阈值的一条，舍弃另一条

2.2、我该怎样取舍？

PE测序可以保留pair中通过length阈值的一条，舍弃另一条，也可以同时舍弃两条。这时我们应该如何取舍？关于这个问题，主要有两种观点：

全部舍弃
保留pair中通过length阈值的一条（下称为SE），将SE结果和PE结果分别对比，最终进行count数的加和。

关于这个问题的细节，可以在黄树嘉大神的公众号《碱基矿工》查看他们的精彩论战。

私以为，可以保留SE之后看SE占总reads数的比例。如果非常小（自己界定，比如<1%？）就可以舍弃。如果比较大，就分别进行对比。

2.3、实际操作

mkdir 4.trimg
cd ./4.trimg/

ln -s ~/path/to/*rmrRNA.fq.gz ./

cat ../SRR_Acc_List.txt | while read id
do
echo "trim_galore --length 35 --paired --retain_unpaired --cores 16 -o ./ ${id}_rm_1.fq.gz ${id}_rm_2.fq.gz"
done > trim.sh

nohup bash trim.sh &

3、结果及报告解读

3.1、结果解读

每一个fastq文件会产生3个文件：val.fq.gz，unpaired.fq.gz以及去接头报告。

可以看到此处SE只有不到10M，相对于总量2G的测序数据不值一提，所以可以舍弃，后续分析会比较简便。

3.2、报告解读

自动检测接头序列

这部分会重新说明一下我们制定的处理参数

用来过滤的文件的一个整体状况。有33.9%的reads含有接头，因为质量问题被去掉的碱基有0.3%。

4、再次质控

fastqc -t 16 -o ./ ./*.fq.gz
multiqc ./*zip -o ./

可以看到主要就是接头序列没有了（图片右下角）

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-04-11，如有侵权请联系 cloudcommunity@tencent.com 删除

sql

php

本文分享自生信菜鸟团微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

sql

php

登录后参与评论

0 条评论

热度