参考snakemake 官方的教程。
这个snakemake workflow 主要包括:mapping, sort >> index >> call variants
我们依然先使用空文件来模拟过程。
mkdir -p data/samples
touch data/genome.fa data/samples/{A..D}.fastq
我们同样需要将规则写入Snakefile
文件中:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
通过bwa,将输入的fq 文件,和提供的参考基因组作为输入, 并直接通过管道符号通过samtools 转为bam。ps:这里如果直接使用samtools 的-o 参数呢?
直接使用snakemake即可:
snakemake -np mapped_reads/A.bam
同样,我们也可以在我们的规则中,使用通配符:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
snakemake -np mapped_reads/{A..C}.bam
增加新的功能,也就是增加新的规则:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
需要注意的是,shell 中的语法规则有所不同。我们在snakemake 中使用的{sample},实际上是创建的wildcards 对象的一个属性。因此在shell 中需要写为{wildcards.sample}
。
ps:这里-T 参数实际也是指定的临时文件的前缀。
尝试运行上述内容:
snakemake -np mapped_reads/B.bam
snakemake -np sorted_reads/B.bam
上面两行代码,只有第二行才会触发完整的规则,这也同样说明snakemake 是以输出为导向的。
接下来加入构建索引内容:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
最后使用bcftools 来查找变异。
这里有个关于expand 的使用技巧,可以参考:[[01-初探snakemake]] 中6-整合多个结果
的介绍。
将其整合入先前的流程:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
SAMPLES = ["A", "B"]
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
尝试运行命令:
snakemake -np calls/all.vcf
可以看到,bcftools 字段是将几个bam 命令整合到了一起:
bcftools mpileup -f data/genome.fa sorted_reads/A.bam sorted_reads/B.bam | bcftools call -mv - > calls/all.vcf
使用[[02-可视化展示流程]] 的方法,我们可以将最终的流程可视化出来:
snakemake --dag calls/all.vcf | dot -Tpng > output/variant.png
这里我们还可以增加一个规则,用于对质量结果绘制直方图:
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
py 脚本如下:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])
其实这里无论是python,还是R,只要是能够接受对应的input 文件即可。
如果使用的是R,可以参考:Snakefiles and Rules — Snakemake 7.8.0 documentation[2]
也提供了R 及Rmarkdown的支持。
ps:以后直接从测序数据得到输出的Rmd 文档。想想都很爽啊!
默认情况下,snakemake 会将工作流中的第一个rule 作为target,也就是将该条rule 下的output 作为snakemake 的默认输出。
因此,我们最好专门的指定一个“总规则”,以确定最终默认的输出,即不指定output下,一般设置all 规则为:
rule all:
input:
"plots/quals.svg"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
SAMPLES = ["A", "B"]
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
有意思的是,这里指定的实际上是input,而非output,如果我们在all 规则中书写的是output,则all 规则将孤立,错误的输出结果:
$ snakemake -np
Building DAG of jobs...
Job stats:
job count min threads max threads
----- ------- ------------- -------------
all 1 1 1
total 1 1 1
[Fri May 27 22:08:04 2022]
localrule all:
output: plots/quals.svg
jobid: 0
reason: Missing output files: plots/quals.svg
resources: tmpdir=/tmp
Job stats:
job count min threads max threads
----- ------- ------------- -------------
all 1 1 1
total 1 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
参见:Short tutorial — Snakemake 7.8.0 documentation[3]
提供了参考数据:
wget https://archive.fastgit.org/snakemake/snakemake-tutorial-data/archive/refs/tags/v7.4.2.tar.gz
tar --wildcards -xf v5.4.5.tar.gz --strip 1 "*/data"
创建项目目录:
mkdir -p results scripts
上述文件中包含如下内容:
$ tree
.
├── data
│ ├── genome.fa
│ ├── genome.fa.amb
│ ├── genome.fa.ann
│ ├── genome.fa.bwt
│ ├── genome.fa.fai
│ ├── genome.fa.pac
│ ├── genome.fa.sa
│ └── samples
│ ├── A.fastq
│ ├── B.fastq
│ └── C.fastq
下载:
mamba create -n snakemake_wes_simple -y pysam matplotlib bwa samtools bcftools snakemake graphviz
发现snakemake 也是可以直接在规则中整合使用的conda 环境的:
rule map_reads:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"results/mapped/{sample}.bam"
conda:
"envs/mapping.yaml"
shell:
"bwa mem {input} | samtools view -b - > {output}"
只要在yaml 文件中写明即可:
channels:
- bioconda
- conda-forge
dependencies:
- bwa =0.7.17
- samtools =1.9
其实conda 也可以生成相关的文件:
conda env export > py36.yaml
不过这里我还是在对应的环境里进行操作。这里额外补充一点,除了工作流外,环境配置,也是可重复任务重要的一环。这里我也将我的conda 环境进行打包,可以直接通过我的配置文件下载相关的软件,使用conda “复刻”我的环境。当然,我还是觉得如docker 之类的容器软件更加方便一些。
py 脚本:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])
需要注意的是,一般来说,使用bwa,我们还需要提前为参考基因组构建索引。
创建Snakefile文件:
SAMPLES = ["A", "B", "C"]
rule all:
input:
"results/calls/all.vcf",
"results/plots/quals.svg"
rule map_reads:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"results/mapped/{sample}.bam"
shell:
"bwa mem {input} | samtools view -b - > {output}"
rule sort_alignments:
input:
"results/mapped/{sample}.bam"
output:
"results/mapped/{sample}.sorted.bam"
shell:
"samtools sort -o {output} {input}"
rule call_variants:
input:
fa="data/genome.fa",
bam=expand("results/mapped/{sample}.sorted.bam", sample=SAMPLES)
output:
"results/calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | bcftools call -mv - > {output}"
rule plot_quals:
input:
"results/calls/all.vcf"
output:
"results/plots/quals.svg"
script:
"scripts/plot-quals.py"
可视化看看:
snakemake --dag | dot -Tpng > dag.png
发现依然得显式的设置输出文件,并且要设定启动的最大核心数:
snakemake --cores 4 -p results/plots/quals.svg
执行snakemake后看看目录下内容:
$ tree
.
├── data
│ ├── genome.fa
│ ├── genome.fa.amb
│ ├── genome.fa.ann
│ ├── genome.fa.bwt
│ ├── genome.fa.fai
│ ├── genome.fa.pac
│ ├── genome.fa.sa
│ └── samples
│ ├── A.fastq
│ ├── B.fastq
│ └── C.fastq
├── results
│ ├── calls
│ │ └── all.vcf
│ ├── mapped
│ │ ├── A.bam
│ │ ├── A.sorted.bam
│ │ ├── B.bam
│ │ ├── B.sorted.bam
│ │ ├── C.bam
│ │ └── C.sorted.bam
│ └── plots
│ └── quals.svg
├── scripts
│ └── plot-quals.py
└── Snakefile
不过我这里尝试生成report却发生报错了:
snakemake --report report.html
很长的报错,其中内容包括:
snakemake report Failed to establish a new connection: [Errno 111] Connection refused'))
显示和github 需要建立某个联系。但从文档来看,report 作用仅仅是生成说明我的workflow 的流程记录,这里并不是很明白。
既然小的测试文件成功执行了。能不能推广到DIY 如转录组在内的流程呢?
[1]
Basics: An example workflow — Snakemake 7.8.0 documentation: https://snakemake.readthedocs.io/en/stable/tutorial/basics.html
[2]
Snakefiles and Rules — Snakemake 7.8.0 documentation: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-external-scripts
[3]
Short tutorial — Snakemake 7.8.0 documentation: https://snakemake.readthedocs.io/en/stable/tutorial/short.html
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有