Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case 2018 BMC Genomics Australian National University
以Eucalyptus pauciflora为例,探索组装叶绿体基因组最有效的方法
## 创建虚拟环境
conda create -n chloroAssembly python=3.6
conda activate chloroAssembly
conda install unicycler
###删除虚拟环境 conda remove -n chloroAssembly --all
https://github.com/rrwick/Unicycler 使用的数据可以在软件主页找到下载链接
unicycler -1 short_reads_1.fastq -2 short_reads_2.fastq -l long_reads_high_depth.fastq -o output_dir -t 16
数据是Helicobacter pylori,在NCBI查了一下基因组大小1,667,867bp,使用unicycler的组装结果
grep ">" assembly.fasta
>1 length=1645796 depth=1.00x circular=true
稍微有点差别,可能是不同的株系吧我猜
graph.png
wget ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR715/SRR7153095/SRR7153095.sra
fasterq-dump SRR7153095.sra -p
检查三代测序数据质量,使用到的是fastqc软件,原来fastqc软件还可以用于三代测序数据
mkdir qcResult
fastqc SRR7153095.sra.fastq -o qcResult -t 8
去除接头,用到的软件是porechop https://github.com/rrwick/Porechop
conda install porechop
porechop -i SRR7153095.sra.fastq -o longReadsRemoveAdapter.fastq -t 8
数据过滤,质量值大于9,最小长度5000,使用到的软件是nanofilt
conda install nanofilt
bgzip longReadsRemoveAdapter.fastq
zcat longReadsRemoveAdapter.fastq.gz | NanoFilt -q 9 -l 5000 > longReadsRemoveAdapterTrim.fastq
数据下载
wget ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR715/SRR7153063/SRR7153063.sra
fasterq-dump --split-files SRR7153063.sra -p
wget ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR715/SRR7153071/SRR7153071.sra
fasterq-dump --split-files SRR7153071.sra -p
数据过滤,不按照论文中提供的脚本来了,直接使用fastq软件进行过滤了 软件主页 https://github.com/OpenGene/fastp
fastp -i SRR7153071.sra_1.fastq -I SRR7153071.sra_2.fastq -o shortReads71_R1.fastq -O shortReads71_R2.fastq
fastp -i SRR7153063.sra_1.fastq -I SRR7153063.sra_2.fastq -o shortReads63_R1.fastq -O shortReads63_R2.fastq