前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >转录组上游分析错误(报错大赏)

转录组上游分析错误(报错大赏)

作者头像
生信技能树
发布2021-10-12 12:00:44
1.7K0
发布2021-10-12 12:00:44
举报
文章被收录于专栏:生信技能树
希望所有学员都可以站在生信技能树的舞台上发光发热!

下面是2020生信入门班学员的随机投稿

因为平时做单细胞比较多。。突然想试试自己还会不会常规转录组的上游分析。。记忆一下各个软件的使用流程也是不错的

首先选择数据 SRX8632887: GSM4645172: M1h rep2; Mus musculus; RNA-Seq

因为 ebi 数据库没有这个数据的相关信息因为这个数据刚刚发布,所以我就使用了 prefetch 下载。
代码语言:javascript
复制
#是一个小鼠的 rna-seq 数据
prefetch  SRR12108837
# 然后验证一下
vdb-validate SRR12108837.sra
#结果 显示 完整的没有问题!
2021-09-19T15:07:16 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' metadata: md5 ok
2021-09-19T15:07:16 vdb-validate.2.10.7 info: Table 'SEQUENCE' metadata: md5 ok
2021-09-19T15:07:17 vdb-validate.2.10.7 info: Column 'QUALITY': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'READ': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'SPOT_GROUP': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' contains only unaligned reads
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' is consistent

下一步就是把这个 sra 转变为 fastqc

代码如下:

代码语言:javascript
复制
fastq-dump --gzip --split-files  -O ${fqdir}  SRR12108837.sra
#因为是双端测序生成如下两个文件!
SRR12108837_1.fastq.gz SRR12108837_2.fastq.gz
#看一下输出日志
(base) t010204@bio1:~/SRR12108837$ cat FASTQ_DUMP.txt 
nohup: ignoring input
Read 23482567 spots for SRR12108837.sra
Written 23482567 spots for SRR12108837.sra
#没有问题。 进行下一步

数据质控!

代码语言:javascript
复制
# 用 fastqc + multiqc 质控
fastqc -t 10 -o $qcdir $fqdir/SRR*.fastq.gz
# 使用MultiQc整合FastQC结果
multiqc *.zip
质控显示这个数据还是有一定问题的!上图!fastqc+multiqc
可以看到这个测序数据有明显的的GC双峰!

接下来 trim 过滤

因为是双端测序,所以trim_galore命令需要加上--paired 的参数,其他参数可以--help搜一下~

代码语言:javascript
复制
trim_galore -j 20 --phred33 -q 30 --length 36 --stringency 3 --fastqc --paired --max_n 3 -o $cleandata $rawdata/SRR12108837_1.fastq.gz\ $rawdata/SRR12108837_2.fastq.gz

为了快点儿进行。。这里选择了20线程。。。。对不起。。

过滤之后进行数据比对!这里用hisat2
代码语言:javascript
复制
#因为我是小鼠的数据,所以选择 ensemble 数据库的
#选择了v100 版本的 fasta 和 100版本的gtf 文件

hisat2-build Mus_musculus.GRCm38.dna.primary_assembly.fa 
Mus_musculus.GRCm38.dna.genome 
#再构建一个 cdna 版本的 Mus_musculus.GRCm38.cdna.all.fa
hisat2-build Mus_musculus.GRCm38.cdna.all.fa 
Mus_musculus.GRCm38.cdna
看一下日志,有无报错。
primary_assembly.fa 索引文件 1小时20min
代码语言:javascript
复制
Returning block of 242335964 for bucket 8
Exited GFM loop
fchr[A]: 0
fchr[C]: 773280124
fchr[G]: 1325927941
fchr[T]: 1878618059
fchr[$]: 2652783500
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 888467914 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.1.ht2
Wrote 663195880 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 1165549977 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.5.ht2
Wrote 675339278 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
    len: 2652783500
    gbwtLen: 2652783501
    nodes: 2652783501
    sz: 663195875
    gbwtSz: 663195876
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 0
    eftabSz: 0
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 165798969
    offsSz: 663195876
    lineSz: 64
    sideSz: 64
    sideGbwtSz: 48
    sideGbwtLen: 192
    numSides: 13816581
    numLines: 13816581
    gbwtTotLen: 884261184
    gbwtTotSz: 884261184
    reverse: 0
    linearFM: Yes
Total time for call to driver() for forward index: 01:20:57
(base) t010204@bio1:~/cankaojiyinzu$ 

Mus_musculus.GRCm38.cdna.all.fa 版本 10分钟
代码语言:javascript
复制
Returning block of 40310005 for bucket 7
Exited GFM loop
fchr[A]: 0
fchr[C]: 57236287
fchr[G]: 110196845
fchr[T]: 164136962
fchr[$]: 218769745
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 110015105 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.1.ht2
Wrote 54692444 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 1053264143 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.5.ht2
Wrote 54828298 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
    len: 218769745
    gbwtLen: 218769746
    nodes: 218769746
    sz: 54692437
    gbwtSz: 54692437
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 0
    eftabSz: 0
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 13673110
    offsSz: 54692440
    lineSz: 64
    sideSz: 64
    sideGbwtSz: 48
    sideGbwtLen: 192
    numSides: 1139426
    numLines: 1139426
    gbwtTotLen: 72923264
    gbwtTotSz: 72923264
    reverse: 0
    linearFM: Yes
Total time for call to driver() for forward index: 00:10:36
(base) t010204@bio1:~/cankaojiyinzu$ 

两种索引方法日志内容一样~~均无报错!接下来就进行比对啦!

写一个简单的 shell 脚本

代码语言:javascript
复制
# !/bin/bash
#index=~/cankaojiyinzu/Mus_musculus.GRCm38.dna.genome
index=~/cankaojiyinzu/Mus_musculus.GRCm38.cdna
inputdir=~/cleandata
outdir=~/cleandata/hisatt
# 单个样本比对
hisat2 -p 20  -x  ${index} -1 ${inputdir}/SRR12108837_1_val_1.fq.gz -2 ${inputdir}/SRR12108837_2_val_2.fq.gz -S ${outdir}/SRR12108837.Hisat_cdna_aln.sam

看一下比对结果情况:先看dna的 比对到 96.9%

代码语言:javascript
复制
(base) t010204@bio1:~/cleandata$ cat hi_wrong.txt 
nohup: ignoring input
23480224 reads; of these:
  23480224 (100.00%) were paired; of these:
    1295673 (5.52%) aligned concordantly 0 times
    18771102 (79.94%) aligned concordantly exactly 1 time
    3413449 (14.54%) aligned concordantly >1 times
    ----
    1295673 pairs aligned concordantly 0 times; of these:
      47378 (3.66%) aligned discordantly 1 time
    ----
    1248295 pairs aligned 0 times concordantly or discordantly; of these:
      2496590 mates make up the pairs; of these:
        1455247 (58.29%) aligned 0 times
        900163 (36.06%) aligned exactly 1 time
        141180 (5.65%) aligned >1 times
96.90% overall alignment rate
(base) t010204@bio1:~/cleandata$ 

再看一下 cdna 的 比对到 92.12%

代码语言:javascript
复制
(base) t010204@bio1:~/cleandata$ cat hi_cdna_wrong.txt 
nohup: ignoring input
23480224 reads; of these:
  23480224 (100.00%) were paired; of these:
    2727958 (11.62%) aligned concordantly 0 times
    9665614 (41.16%) aligned concordantly exactly 1 time
    11086652 (47.22%) aligned concordantly >1 times
    ----
    2727958 pairs aligned concordantly 0 times; of these:
      32392 (1.19%) aligned discordantly 1 time
    ----
    2695566 pairs aligned 0 times concordantly or discordantly; of these:
      5391132 mates make up the pairs; of these:
        3699423 (68.62%) aligned 0 times
        692537 (12.85%) aligned exactly 1 time
        999172 (18.53%) aligned >1 times
92.12% overall alignment rate

用 Samtools 把比对的 Sam文件 转为二进制 bam 文件

代码语言:javascript
复制
# sam转bam
samtools sort -@ 20 -o SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sam
# 对bam建索引
samtools index SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sorted.bam.bai
看一下sam 文件和 bam 文件的大小情况
代码语言:javascript
复制
(base) t010204@bio1:~/cleandata/hisatt$ ll -ah
total 68G
drwxrwxr-x 2 t010204 t010204 4.0K 9月  20 00:29 ./
drwxrwxr-x 5 t010204 t010204 4.0K 9月  20 00:01 ../
-rw-rw-r-- 1 t010204 t010204   44 9月  19 21:21 index.txt
-rw-rw-r-- 1 t010204 t010204   87 9月  20 00:26 samtool_cdna2_sort.txt
-rw-rw-r-- 1 t010204 t010204   87 9月  19 21:04 samtool_cdna_sort.txt
-rw-rw-r-- 1 t010204 t010204   87 9月  19 21:00 samtool_sort.txt
-rw-rw-r-- 1 t010204 t010204  23G 9月  19 20:13 SRR12108837.Hisat_aln.sam
-rw-rw-r-- 1 t010204 t010204 1.9G 9月  19 21:02 SRR12108837.Hisat_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204 2.1M 9月  19 21:20 SRR12108837.Hisat_aln.sorted.bam.bai
-rw-rw-r-- 1 t010204 t010204 3.1G 9月  20 00:29 SRR12108837.Hisat_cdna2_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204  37G 9月  19 20:30 SRR12108837.Hisat_cdna_aln.sam
-rw-rw-r-- 1 t010204 t010204 3.1G 9月  19 21:06 SRR12108837.Hisat_cdna_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204 6.4M 9月  19 21:22 SRR12108837.Hisat_cdna_aln.sorted.bam.bai

可以看到 Sam 文件比 bam 大很多。而且 cDNA要比dna 比对得到的结果更大。(可以看到有两个cdna的bam文件,是因为害怕这里出错,所以用samtools 运行了两遍,看看两遍大小是否一致!)

走featureCounts定量流程

写一个 featurecounts shell 脚本因为之前用v100 .fa作为索引,所以 这里用v100 gtf 文件作为注释文件,保证版本一致性。
代码语言:javascript
复制
# !/bin/bash
gtf=~/cankaojiyinzu/Mus_musculus.GRCm38.100.chr.gtf.gz
inputdir=~/cleandata/hisatt
outdir=~/cleandata/feature_counts

featureCounts -T 20 -p -t exon -g gene_id -a $gtf -o $outdir/all.cdna_id.txt  $inputdir/*.Hisat_cdna_aln.sorted.bam

看一下结果!dna 比对了 66.5% 0.39 分钟
代码语言:javascript
复制
nohup: ignoring input

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   v2.0.1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o SRR12108837.Hisat_aln.sorted.bam               ||
||                                                                            ||
||             Output file : all.id.txt                                       ||
||                 Summary : all.id.txt.summary                               ||
||              Annotation : Mus_musculus.GRCm38.100.chr.gtf.gz (GTF)         ||
||      Dir for temp files : /home/data/t010204/cleandata/feature_counts      ||
||                                                                            ||
||                 Threads : 20                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Mus_musculus.GRCm38.100.chr.gtf.gz ...                ||
||    Features : 843402                                                       ||
||    Meta-features : 55401                                                   ||
||    Chromosomes/contigs : 22                                                ||
||                                                                            ||
|| Process BAM file SRR12108837.Hisat_aln.sorted.bam...                       ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 27509024                                             ||
||    Successfully assigned alignments : 18300209 (66.5%)                     ||
||    Running time : 0.39 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||
|| Summary of counting results can be found in file "/home/data/t010204/clea  ||
|| ndata/feature_counts/all.id.txt.summary"                                   ||
||                                                                            ||
\\============================================================================//

再看一下cdna 的 成功比对0 ?????

代码语言:javascript
复制
nohup: ignoring input

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   v2.0.1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o SRR12108837.Hisat_cdna2_aln.sorted.bam         ||
||                                                                            ||
||             Output file : all.cdna2_id.txt                                 ||
||                 Summary : all.cdna2_id.txt.summary                         ||
||              Annotation : Mus_musculus.GRCm38.100.gtf.gz (GTF)             ||
||      Dir for temp files : /home/data/t010204/cleandata/feature_counts      ||
||                                                                            ||
||                 Threads : 20                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Mus_musculus.GRCm38.100.gtf.gz ...                    ||
||    Features : 843712                                                       ||
||    Meta-features : 55487                                                   ||
||    Chromosomes/contigs : 45                                                ||
||                                                                            ||
|| Process BAM file SRR12108837.Hisat_cdna2_aln.sorted.bam...                 ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 45117185                                             ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 0.65 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||
|| Summary of counting results can be found in file "/home/data/t010204/clea  ||
|| ndata/feature_counts/all.cdna2_id.txt.summary"                             ||
||                                                                            ||
\\============================================================================//

cdna的那个比对后,不能靠gtf定量, 因为gtf是描述基因在dna的坐标。

cdna是没有坐标的,染色体都没有!! 这才懂得为什么出现没有比对上的情况!!!

人傻了。。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-09-21,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 因为平时做单细胞比较多。。突然想试试自己还会不会常规转录组的上游分析。。记忆一下各个软件的使用流程也是不错的
    • 首先选择数据 SRX8632887: GSM4645172: M1h rep2; Mus musculus; RNA-Seq
      • 下一步就是把这个 sra 转变为 fastqc
        • 数据质控!
          • 接下来 trim 过滤
            • 用 Samtools 把比对的 Sam文件 转为二进制 bam 文件
            • 走featureCounts定量流程
            • 再看一下cdna 的 成功比对0 ?????
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档