通过简单数据熟悉Linux下生物信息学各种操作2

原地址

一共三部分 通过简单数据熟悉Linux下生物信息学各种操作1 通过简单数据熟悉Linux下生物信息学各种操作2 通过简单数据熟悉Linux下生物信息学各种操作3


11安装使用bwa(mac)

11.1安装编译并创建到bin目录的链接

cat src
curl -OL http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.17.tar.bz2
tar jxvf bwa-0.7.17.tar.bz2
cd bwa-0.7.17/
make
ln -s ~/src/bwa-0.7.17/bwa ~/bin

11.2 制作index

创建文件夹放references 获取1999爆发ebola的参考基因组

mkdir -p ~/refs/852
efetch -db nucleotide -id NC_002549 -format fasta > ~/refs/852/ebola-1999.fa
bwa index ~/refs/852/ebola-1999.fa
ls ~/refs/852
ebola-1999.fa       ebola-1999.fa.ann   ebola-1999.fa.pac
ebola-1999.fa.amb   ebola-1999.fa.bwt   ebola-1999.fa.sa

通过blast也可以搜寻到

makeblastdb -in ~/refs/852/ebola-1999.fa -dbtype nucl -parse_seqids
ls ~/refs/852
ebola-1999.fa       ebola-1999.fa.nhr   ebola-1999.fa.nsi
ebola-1999.fa.amb   ebola-1999.fa.nin   ebola-1999.fa.nsq
ebola-1999.fa.ann   ebola-1999.fa.nog   ebola-1999.fa.pac
ebola-1999.fa.bwt   ebola-1999.fa.nsd   ebola-1999.fa.sa

共有16个文件,这也是为什么刚才创建单独文件夹的原因 获取ebolas 基因组的前1行作为query序列

head -2 ~/refs/852/ebola-1999.fa > query.fa

现在用bwa-mem把上面的query map到它自己的基因组上去

cat results.sam 
@SQ SN:NC_002549.1  LN:18959
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0   NC_002549.1 1   60  70M *   0   0   CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA  *   NM:i:0  MD:Z:70 AS:i:70 XS:i:0

比较用上面同样序列的blasting结果

blastn -db ~/refs/852/ebola-1999.fa -query query.fa > results.blastn
Query= NC_002549.1 Zaire ebolavirus isolate Ebola
virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, complete genome

Length=70
                                                                      Score        E
Sequences producing significant alignments:                          (Bits)     Value

NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD...  130        6e-34


>NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, 
complete genome
Length=18959

 Score = 130 bits (70),  Expect = 6e-34
 Identities = 70/70 (100%), Gaps = 0/70 (0%)
 Strand=Plus/Plus

Query  1   CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGA  60
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1   CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGA  60

Query  61  AGATTAATAA  70
           ||||||||||
Sbjct  61  AGATTAATAA  70



Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 1079922


  Database: /Users/ucco/refs/852/ebola-1999.fa

12理解SAM格式

现在vi query.fa,增加或删除几个bases,然后进行比对 比如在第6个碱基开始添加一串T

cat query.fa 
>NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, complete genome
CGGACTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA

再次运行比对程序,结果如下

@SQ SN:NC_002549.1  LN:18959
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0   NC_002549.1 6   60  14S65M  *   0   CGGACTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAANM:i:0   MD:Z:65 AS:i:65 XS:i:0
#之前没有修改bases的序列比对结果
@SQ SN:NC_002549.1  LN:18959
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0   NC_002549.1 1   60  70M *   0   0   CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA  *   NM:i:0  MD:Z:70 AS:i:70 XS:i:0

12.1切片操作看特定列

为了查看特定列,可以进行cut操作,要深入理解每列的意义 query id,alignment flag and target id

cat results.sam |cut -f 1,2,3
@SQ SN:NC_002549.1  LN:18959
@PG ID:bwa  PN:bwa
NC_002549.1 0   NC_002549.1

比对的start,mapping quality,CIGAR string

cat results.sam |cut -f 4,5,6

VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
1   60  70M

paired end read information,name ,position and length of template

cat results.sam |cut -f 7,8,9


*   0   0

query sequence, query quality, other attributes

cat results.sam |cut -f 10-14


CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA  *   NM:i:0  MD:Z:70 AS:i:70

12.2 同时比对两个序列

在这里下载示例数据

curl -O http://www.personal.psu.edu/iua1/courses/files/2014/data-14.tar.gz
tar xzvf data-14.tar.gz

会产生两个名字为read1.fq和read2.fq的文件 用bwa比对

bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam

部分结果如下

@SQ SN:NC_002549.1  LN:18959
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa read1.fq read2.fq
gi|10313991|ref|NC_002549.1|_17166_17748_1:1:0_0:0:0_0  83  NC_002549.1 17679   60  70M =   17166   -583    AATTCAACGAAGCCCATACTGGCTAAGTCATTTAACTCAGTATGCTGACTGTGAGTTACATTTAAGTTAT  2222222222222222222222222222222222222222222222222222222222222222222222  NM:i:0  MD:Z:70 MC:Z:70M    AS:i:70 XS:i:0
gi|10313991|ref|NC_002549.1|_17166_17748_1:1:0_0:0:0_0  163 NC_002549.1 17166   60  70M =   17679   583 CAGTCTAGAGTCAGAAATAGTATCAGGAATGACTACTCCTAGGATGCTTCTACCCGTTATGTCAAAATTC  2222222222222222222222222222222222222222222222222222222222222222222222  NM:i:2  MD:Z:4A49T15    MC:Z:70M    AS:i:6XS:i:0
gi|10313991|ref|NC_002549.1|_4818_5334_1:0:0_3:0:0_1    99  NC_002549.1 4818    60  70M =   5265    517 GCCATCATGCTTGCTTCATACACTATCACCCAATTCGGCAAGGCAACCAATCCACTTGTCAGAGTCAATC  2222222222222222222222222222222222222222222222222222222222222222222222  NM:i:1  MD:Z:32T37  MC:Z:70M    AS:i:6XS:i:0
gi|10313991|ref|NC_002549.1|_4818_5334_1:0:0_3:0:0_1    147 NC_002549.1 5265    60  70M =   4818    -517    GTGCCAGAAACTCTGGTCCTCAAGCTGACCGGTAAGAAGGTGACTTCTAAAATTCGACAACCAATCATCC  2222222222222222222222222222222222222222222222222222222222222222222222  NM:i:3  MD:Z:19A32A1G15 MC:Z:70M    AS:i:5XS:i:0
gi|10313991|ref|NC_002549.1|_8927_9349_3:0:0_3:0:0_2    83  NC_002549.1 9280    60  70M =   8927    -423    GGTTCAGGGTTAAGAACATTGGTTCCTCAATCAGATAATGAGGAAGCTTCAACCAAACCGGGGAAATGCT  2222222222222222222222222222222222222222222222222222222222222222222222  NM:i:3  MD:Z:1T54C7C5   MC:Z:70M    AS:i:5XS:i:0

为了简单明了表示,把上述的qurey加T后一起运行

 cp query.fa query_addT.fa
vi query_addT.fa
 bwa mem ~/refs/852/ebola-1999.fa query.fa query_addT.fa> xresults.sam

结果如下

@SQ SN:NC_002549.1  LN:18959
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa query_addT.fa
NC_002549.1 65  NC_002549.1 1   60  70M =   6   6   CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA  *   NM:i:0  MD:Z:70 MC:Z:15S65M AS:i:70 XS:i:0
NC_002549.1 129 NC_002549.1 6   60  15S65M  =   1   -6  CGGACTTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA    *   NM:i:0  MD:Z:65 MC:Z:70M    AS:i:65 XS:i:0

13 BAM 文件和samtools

这部分暂时隔过

14 用ReadSeq转换数据

14.1安装ReadSeq

mkdir -p readseq
cd ~/src/readseq
curl -OL http://iubio.bio.indiana.edu/soft/molbio/readseq/java/readseq.jar

我打不开这个网站,然后google下载的. 因为readseq调用的方式和前面安装的trimmomatic相似,所以cp

cp ~/bin/trimmomatic ~/bin/readseq
vi ~/bin/readseq

改成下图即可

vi_readseq

14.2获取1999 ebola基因组的full genbank 记录

http://www.ncbi.nlm.nih.gov/genome/4887 进入lec文件夹,随便你以前用其他名字命名的 获取complete data

efetch -db nucleotide -id NC_002549.1 -format gb > NC.gb

格式为GFF(Generic Feature Format)文件

readseq -format=GFF NC.gb
# 也可以设定输出名字
readseq -format=GFF -o NC-all.gff NC.gb

会有以下几个文件

4files

提取到fasta 文件

readseq -format=FASTA -o NC.fa NC.gb

先查看一下gff文件

cat NC-all.gff |head
##gff-version 2
# seqname   source  feature start   end score   strand  frame   attributes
NC_002549   -   source  1   18959   .   +   .   organism "Zaire ebolavirus" ; mol_type "viral cRNA" ; isolate "Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga" ; db_xref "taxon:186538"
NC_002549   -   5'UTR   1   55  .   +   .   note "leader region" ; citation "[1]" ; function "regulation or initiation of RNA replication"
NC_002549   -   gene    56  3026    .   +   .   gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549   -   mRNA    56  3026    .   +   .   gene "NP" ; locus_tag "ZEBOVgp1" ; product "nucleoprotein" ; db_xref "GeneID:911830"
NC_002549   -   regulatory  56  67  .   +   .   regulatory_class "other" ; gene "NP" ; locus_tag "ZEBOVgp1" ; note "putative; transcription start signal" ; citation "[1]"
NC_002549   -   CDS 470 2689    .   +   .   gene "NP" ; locus_tag "ZEBOVgp1" ; function "encapsidation of genomic RNA" ; codon_start 1 ; product "nucleoprotein" ; protein_id "NP_066243.1" ; db_xref "GeneID:911830" ; translation "MDSRPQKIWMAPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAFEAGVDFQESADSFLLMLCLHHAYQGDYKLFLESGAVKYLEGHGFRFEVKKRDGVKRLEELLPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHAEQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQARFSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLLNLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAESRELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSGHYDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAPDDLVLFDLDEDDEDTKPVPNRSTKGGQQKNSQKGQHIEGRQTQSRPIQNVPGPHRTIHHASAPLTDNDRRNEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRDGTSNRTPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQSEHSFEEMYRHILRSQGPFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQFYWPVMNHKNKFMAILQHHQ"
NC_002549   -   regulatory  3015    3026    .   +   .   regulatory_class "polyA_signal_sequence" ; gene "NP" ; locus_tag "ZEBOVgp1"
NC_002549   -   misc_feature    3027    3031    .   +   .   note "intergenic region"

只获取gff文件中的gene rows

cat NC-all.gff |egrep '\tgene\t'
NC_002549   -   gene    56  3026    .   +   .   gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549   -   gene    3032    4407    .   +   .   gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549   -   gene    4390    5894    .   +   .   gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549   -   gene    5900    8305    .   +   .   gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549   -   gene    8288    9740    .   +   .   gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549   -   gene    9885    11518   .   +   .   gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549   -   gene    11501   18282   .   +   .   gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"

实际上我们还想要前两行的注释行

cat NC-all.gff |head -2 > NC-genes.gff
cat NC-all.gff | egrep '\tgene\t' >> NC-genes.gff
cat NC-genes.gff 
##gff-version 2
# seqname   source  feature start   end score   strand  frame   attributes
NC_002549   -   gene    56  3026    .   +   .   gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549   -   gene    3032    4407    .   +   .   gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549   -   gene    4390    5894    .   +   .   gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549   -   gene    5900    8305    .   +   .   gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549   -   gene    8288    9740    .   +   .   gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549   -   gene    9885    11518   .   +   .   gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549   -   gene    11501   18282   .   +   .   gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"
NC_002549   -   gene    56  3026    .   +   .   gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549   -   gene    3032    4407    .   +   .   gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549   -   gene    4390    5894    .   +   .   gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549   -   gene    5900    8305    .   +   .   gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549   -   gene    8288    9740    .   +   .   gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549   -   gene    9885    11518   .   +   .   gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549   -   gene    11501   18282   .   +   .   gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"

染色体坐标不匹配,需要修复 重新索引并且重新比对

cp NC.fa ~/refs/852/
bwa index ~/refs/852/NC.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.01 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /Users/ucco/refs/852/NC.fa
[main] Real time: 0.113 sec; CPU: 0.015 sec

再align 注意我的命名有点乱,最好去完全按原文中的命名 下面的align.sh我没找到

cp ../lec4/*.fq .
bash align.sh read1.fq read2.fq results.bam

载入IGV,看100-150bp区域的深度

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券