转录组上游分析错误（报错大赏）

生信技能树

发布于 2021-10-12 12:00:44

1.7K0

发布于 2021-10-12 12:00:44

文章被收录于专栏：生信技能树

希望所有学员都可以站在生信技能树的舞台上发光发热！

下面是2020生信入门班学员的随机投稿

因为平时做单细胞比较多。。突然想试试自己还会不会常规转录组的上游分析。。记忆一下各个软件的使用流程也是不错的

首先选择数据 SRX8632887: GSM4645172: M1h rep2; Mus musculus; RNA-Seq

因为 ebi 数据库没有这个数据的相关信息因为这个数据刚刚发布，所以我就使用了 prefetch 下载。

#是一个小鼠的 rna-seq 数据
prefetch  SRR12108837
# 然后验证一下
vdb-validate SRR12108837.sra
#结果 显示 完整的没有问题！
2021-09-19T15:07:16 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' metadata: md5 ok
2021-09-19T15:07:16 vdb-validate.2.10.7 info: Table 'SEQUENCE' metadata: md5 ok
2021-09-19T15:07:17 vdb-validate.2.10.7 info: Column 'QUALITY': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'READ': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'SPOT_GROUP': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' contains only unaligned reads
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' is consistent

下一步就是把这个 sra 转变为 fastqc

代码如下:

fastq-dump --gzip --split-files  -O ${fqdir}  SRR12108837.sra
#因为是双端测序生成如下两个文件！
SRR12108837_1.fastq.gz SRR12108837_2.fastq.gz
#看一下输出日志
(base) t010204@bio1:~/SRR12108837$ cat FASTQ_DUMP.txt 
nohup: ignoring input
Read 23482567 spots for SRR12108837.sra
Written 23482567 spots for SRR12108837.sra
#没有问题。 进行下一步

数据质控！

# 用 fastqc + multiqc 质控
fastqc -t 10 -o $qcdir $fqdir/SRR*.fastq.gz
# 使用MultiQc整合FastQC结果
multiqc *.zip

质控显示这个数据还是有一定问题的！上图！fastqc+multiqc

可以看到这个测序数据有明显的的GC双峰!

接下来 trim 过滤

因为是双端测序，所以trim_galore命令需要加上--paired 的参数,其他参数可以--help搜一下~

trim_galore -j 20 --phred33 -q 30 --length 36 --stringency 3 --fastqc --paired --max_n 3 -o $cleandata $rawdata/SRR12108837_1.fastq.gz\ $rawdata/SRR12108837_2.fastq.gz

为了快点儿进行。。这里选择了20线程。。。。对不起。。

过滤之后进行数据比对！这里用hisat2

#因为我是小鼠的数据，所以选择 ensemble 数据库的
#选择了v100 版本的 fasta 和 100版本的gtf 文件

hisat2-build Mus_musculus.GRCm38.dna.primary_assembly.fa 
Mus_musculus.GRCm38.dna.genome 
#再构建一个 cdna 版本的 Mus_musculus.GRCm38.cdna.all.fa
hisat2-build Mus_musculus.GRCm38.cdna.all.fa 
Mus_musculus.GRCm38.cdna

看一下日志，有无报错。

primary_assembly.fa 索引文件 1小时20min

Returning block of 242335964 for bucket 8
Exited GFM loop
fchr[A]: 0
fchr[C]: 773280124
fchr[G]: 1325927941
fchr[T]: 1878618059
fchr[$]: 2652783500
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 888467914 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.1.ht2
Wrote 663195880 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 1165549977 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.5.ht2
Wrote 675339278 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
    len: 2652783500
    gbwtLen: 2652783501
    nodes: 2652783501
    sz: 663195875
    gbwtSz: 663195876
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 0
    eftabSz: 0
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 165798969
    offsSz: 663195876
    lineSz: 64
    sideSz: 64
    sideGbwtSz: 48
    sideGbwtLen: 192
    numSides: 13816581
    numLines: 13816581
    gbwtTotLen: 884261184
    gbwtTotSz: 884261184
    reverse: 0
    linearFM: Yes
Total time for call to driver() for forward index: 01:20:57
(base) t010204@bio1:~/cankaojiyinzu$

Mus_musculus.GRCm38.cdna.all.fa 版本 10分钟

Returning block of 40310005 for bucket 7
Exited GFM loop
fchr[A]: 0
fchr[C]: 57236287
fchr[G]: 110196845
fchr[T]: 164136962
fchr[$]: 218769745
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 110015105 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.1.ht2
Wrote 54692444 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 1053264143 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.5.ht2
Wrote 54828298 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
    len: 218769745
    gbwtLen: 218769746
    nodes: 218769746
    sz: 54692437
    gbwtSz: 54692437
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 0
    eftabSz: 0
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 13673110
    offsSz: 54692440
    lineSz: 64
    sideSz: 64
    sideGbwtSz: 48
    sideGbwtLen: 192
    numSides: 1139426
    numLines: 1139426
    gbwtTotLen: 72923264
    gbwtTotSz: 72923264
    reverse: 0
    linearFM: Yes
Total time for call to driver() for forward index: 00:10:36
(base) t010204@bio1:~/cankaojiyinzu$

两种索引方法日志内容一样~~均无报错！接下来就进行比对啦！

写一个简单的 shell 脚本

# !/bin/bash
#index=~/cankaojiyinzu/Mus_musculus.GRCm38.dna.genome
index=~/cankaojiyinzu/Mus_musculus.GRCm38.cdna
inputdir=~/cleandata
outdir=~/cleandata/hisatt
# 单个样本比对
hisat2 -p 20  -x  ${index} -1 ${inputdir}/SRR12108837_1_val_1.fq.gz -2 ${inputdir}/SRR12108837_2_val_2.fq.gz -S ${outdir}/SRR12108837.Hisat_cdna_aln.sam

看一下比对结果情况:先看dna的比对到 96.9%

(base) t010204@bio1:~/cleandata$ cat hi_wrong.txt 
nohup: ignoring input
23480224 reads; of these:
  23480224 (100.00%) were paired; of these:
    1295673 (5.52%) aligned concordantly 0 times
    18771102 (79.94%) aligned concordantly exactly 1 time
    3413449 (14.54%) aligned concordantly >1 times
    ----
    1295673 pairs aligned concordantly 0 times; of these:
      47378 (3.66%) aligned discordantly 1 time
    ----
    1248295 pairs aligned 0 times concordantly or discordantly; of these:
      2496590 mates make up the pairs; of these:
        1455247 (58.29%) aligned 0 times
        900163 (36.06%) aligned exactly 1 time
        141180 (5.65%) aligned >1 times
96.90% overall alignment rate
(base) t010204@bio1:~/cleandata$

再看一下 cdna 的比对到 92.12%

(base) t010204@bio1:~/cleandata$ cat hi_cdna_wrong.txt 
nohup: ignoring input
23480224 reads; of these:
  23480224 (100.00%) were paired; of these:
    2727958 (11.62%) aligned concordantly 0 times
    9665614 (41.16%) aligned concordantly exactly 1 time
    11086652 (47.22%) aligned concordantly >1 times
    ----
    2727958 pairs aligned concordantly 0 times; of these:
      32392 (1.19%) aligned discordantly 1 time
    ----
    2695566 pairs aligned 0 times concordantly or discordantly; of these:
      5391132 mates make up the pairs; of these:
        3699423 (68.62%) aligned 0 times
        692537 (12.85%) aligned exactly 1 time
        999172 (18.53%) aligned >1 times
92.12% overall alignment rate

用 Samtools 把比对的 Sam文件转为二进制 bam 文件

# sam转bam
samtools sort -@ 20 -o SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sam
# 对bam建索引
samtools index SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sorted.bam.bai

看一下sam 文件和 bam 文件的大小情况

(base) t010204@bio1:~/cleandata/hisatt$ ll -ah
total 68G
drwxrwxr-x 2 t010204 t010204 4.0K 9月  20 00:29 ./
drwxrwxr-x 5 t010204 t010204 4.0K 9月  20 00:01 ../
-rw-rw-r-- 1 t010204 t010204   44 9月  19 21:21 index.txt
-rw-rw-r-- 1 t010204 t010204   87 9月  20 00:26 samtool_cdna2_sort.txt
-rw-rw-r-- 1 t010204 t010204   87 9月  19 21:04 samtool_cdna_sort.txt
-rw-rw-r-- 1 t010204 t010204   87 9月  19 21:00 samtool_sort.txt
-rw-rw-r-- 1 t010204 t010204  23G 9月  19 20:13 SRR12108837.Hisat_aln.sam
-rw-rw-r-- 1 t010204 t010204 1.9G 9月  19 21:02 SRR12108837.Hisat_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204 2.1M 9月  19 21:20 SRR12108837.Hisat_aln.sorted.bam.bai
-rw-rw-r-- 1 t010204 t010204 3.1G 9月  20 00:29 SRR12108837.Hisat_cdna2_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204  37G 9月  19 20:30 SRR12108837.Hisat_cdna_aln.sam
-rw-rw-r-- 1 t010204 t010204 3.1G 9月  19 21:06 SRR12108837.Hisat_cdna_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204 6.4M 9月  19 21:22 SRR12108837.Hisat_cdna_aln.sorted.bam.bai

可以看到 Sam 文件比 bam 大很多。而且 cDNA要比dna 比对得到的结果更大。(可以看到有两个cdna的bam文件，是因为害怕这里出错，所以用samtools 运行了两遍，看看两遍大小是否一致！)

走featureCounts定量流程

写一个 featurecounts shell 脚本因为之前用v100 .fa作为索引，所以这里用v100 gtf 文件作为注释文件，保证版本一致性。

# !/bin/bash
gtf=~/cankaojiyinzu/Mus_musculus.GRCm38.100.chr.gtf.gz
inputdir=~/cleandata/hisatt
outdir=~/cleandata/feature_counts

featureCounts -T 20 -p -t exon -g gene_id -a $gtf -o $outdir/all.cdna_id.txt  $inputdir/*.Hisat_cdna_aln.sorted.bam

看一下结果！dna 比对了 66.5% 0.39 分钟

nohup: ignoring input

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   v2.0.1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o SRR12108837.Hisat_aln.sorted.bam               ||
||                                                                            ||
||             Output file : all.id.txt                                       ||
||                 Summary : all.id.txt.summary                               ||
||              Annotation : Mus_musculus.GRCm38.100.chr.gtf.gz (GTF)         ||
||      Dir for temp files : /home/data/t010204/cleandata/feature_counts      ||
||                                                                            ||
||                 Threads : 20                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Mus_musculus.GRCm38.100.chr.gtf.gz ...                ||
||    Features : 843402                                                       ||
||    Meta-features : 55401                                                   ||
||    Chromosomes/contigs : 22                                                ||
||                                                                            ||
|| Process BAM file SRR12108837.Hisat_aln.sorted.bam...                       ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 27509024                                             ||
||    Successfully assigned alignments : 18300209 (66.5%)                     ||
||    Running time : 0.39 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||
|| Summary of counting results can be found in file "/home/data/t010204/clea  ||
|| ndata/feature_counts/all.id.txt.summary"                                   ||
||                                                                            ||
\\============================================================================//

再看一下cdna 的成功比对0 ？？？？？

nohup: ignoring input

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   v2.0.1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o SRR12108837.Hisat_cdna2_aln.sorted.bam         ||
||                                                                            ||
||             Output file : all.cdna2_id.txt                                 ||
||                 Summary : all.cdna2_id.txt.summary                         ||
||              Annotation : Mus_musculus.GRCm38.100.gtf.gz (GTF)             ||
||      Dir for temp files : /home/data/t010204/cleandata/feature_counts      ||
||                                                                            ||
||                 Threads : 20                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Mus_musculus.GRCm38.100.gtf.gz ...                    ||
||    Features : 843712                                                       ||
||    Meta-features : 55487                                                   ||
||    Chromosomes/contigs : 45                                                ||
||                                                                            ||
|| Process BAM file SRR12108837.Hisat_cdna2_aln.sorted.bam...                 ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 45117185                                             ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 0.65 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||
|| Summary of counting results can be found in file "/home/data/t010204/clea  ||
|| ndata/feature_counts/all.cdna2_id.txt.summary"                             ||
||                                                                            ||
\\============================================================================//

cdna的那个比对后，不能靠gtf定量，因为gtf是描述基因在dna的坐标。

cdna是没有坐标的，染色体都没有!! 这才懂得为什么出现没有比对上的情况！！！

人傻了。。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-09-21，如有侵权请联系 cloudcommunity@tencent.com 删除

shell

php

本文分享自生信技能树微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

shell

php

登录后参与评论

0 条评论

热度

转录组上游分析错误（报错大赏）

转录组上游分析错误（报错大赏）

因为平时做单细胞比较多。。突然想试试自己还会不会常规转录组的上游分析。。记忆一下各个软件的使用流程也是不错的

首先选择数据 SRX8632887: GSM4645172: M1h rep2; Mus musculus; RNA-Seq

因为 ebi 数据库没有这个数据的相关信息因为这个数据刚刚发布，所以我就使用了 prefetch 下载。

下一步就是把这个 sra 转变为 fastqc

数据质控！

质控显示这个数据还是有一定问题的！上图！fastqc+multiqc

可以看到这个测序数据有明显的的GC双峰!

接下来 trim 过滤

过滤之后进行数据比对！这里用hisat2

看一下日志，有无报错。

primary_assembly.fa 索引文件 1小时20min

Mus_musculus.GRCm38.cdna.all.fa 版本 10分钟

两种索引方法日志内容一样~~均无报错！接下来就进行比对啦！

用 Samtools 把比对的 Sam文件转为二进制 bam 文件

看一下sam 文件和 bam 文件的大小情况

可以看到 Sam 文件比 bam 大很多。而且 cDNA要比dna 比对得到的结果更大。(可以看到有两个cdna的bam文件，是因为害怕这里出错，所以用samtools 运行了两遍，看看两遍大小是否一致！)

走featureCounts定量流程

写一个 featurecounts shell 脚本因为之前用v100 .fa作为索引，所以这里用v100 gtf 文件作为注释文件，保证版本一致性。

看一下结果！dna 比对了 66.5% 0.39 分钟

再看一下cdna 的成功比对0 ？？？？？

cdna的那个比对后，不能靠gtf定量，因为gtf是描述基因在dna的坐标。

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

转录组上游分析错误（报错大赏）

转录组上游分析错误（报错大赏）

因为平时做单细胞比较多。。突然想试试自己还会不会常规转录组的上游分析。。记忆一下各个软件的使用流程也是不错的

首先选择数据 SRX8632887: GSM4645172: M1h rep2; Mus musculus; RNA-Seq

因为 ebi 数据库没有这个数据的相关信息因为这个数据刚刚发布，所以我就使用了 prefetch 下载。

下一步就是把这个 sra 转变为 fastqc

数据质控！

质控显示这个数据还是有一定问题的！上图！fastqc+multiqc

可以看到这个测序数据有明显的的GC双峰!

接下来 trim 过滤

过滤之后进行数据比对！这里用hisat2

看一下日志，有无报错。

primary_assembly.fa 索引文件 1小时20min

Mus_musculus.GRCm38.cdna.all.fa 版本 10分钟

两种索引方法日志内容一样~~均无报错！接下来就进行比对啦！

用 Samtools 把比对的 Sam文件 转为二进制 bam 文件

看一下sam 文件和 bam 文件的大小情况

可以看到 Sam 文件比 bam 大很多。而且 cDNA要比dna 比对得到的结果更大。(可以看到有两个cdna的bam文件，是因为害怕这里出错，所以用samtools 运行了两遍，看看两遍大小是否一致！)

走featureCounts定量流程

写一个 featurecounts shell 脚本因为之前用v100 .fa作为索引，所以 这里用v100 gtf 文件作为注释文件，保证版本一致性。

看一下结果！dna 比对了 66.5% 0.39 分钟

再看一下cdna 的 成功比对0 ？？？？？

cdna的那个比对后，不能靠gtf定量， 因为gtf是描述基因在dna的坐标。

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

用 Samtools 把比对的 Sam文件转为二进制 bam 文件

写一个 featurecounts shell 脚本因为之前用v100 .fa作为索引，所以这里用v100 gtf 文件作为注释文件，保证版本一致性。

再看一下cdna 的成功比对0 ？？？？？

cdna的那个比对后，不能靠gtf定量，因为gtf是描述基因在dna的坐标。