继续介绍一些snakemake的进阶操作。
如bwa 等软件,我们可以分配多线程以提高任务的执行速度的。同样,我们可以把线程的信息配置在规则中:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
在snakemake 执行的时候,我们需要制定--cores
参数,设置snakemake 全部任务执行时,不超过的最大线程数。
比如当bwa 规则调用了8个线程,snakemake 则会将剩下的线程分配给其他数据执行bwa 以外的线程消耗数目较少的任务。
比起一般的bash 脚本来说,snakemake 的一个优势在于,其会充分调度线程,以确保线程充分被调度。(真是优秀员工啊!)
我们可以在snakemake中,将使用的通配符或文件信息,写到config 文件中,并通过config访问:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
configfile:"config.yaml"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
详细用法,可以参考:Configuration — Snakemake 7.8.0 documentation[2]
单纯从这点上,我并没有体会到config 的便利。
但是,如果是给外部用户使用呢?或者是应对不同的场景需求,设置参数呢?比如:human_genomics_pipeline/config.yaml at master · ESR-NZ/human_genomics_pipeline (github.com)[3]
提供了一个模板:
##############################
###### Overall workflow ######
##############################
# Type of input data (either 'Single' or 'Cohort')
DATA: ""
# Should the pipeline be GPU accelerated where possible? (either 'Yes' or 'No')
GPU_ACCELERATED: ""
# File path to the reference genome (.fasta)
REFGENOME: ""
# File path to dbSNP database
dbSNP: ""
# Temporary file directory
TEMPDIR: ""
# Whole exome sequence settings (leave blank if analysing other data such as whole genome sequence data)
WES:
# File path to the exome capture regions over which to operate
INTERVALS: ""
# Padding (in bp) to add to each region
PADDING: ""
##############################
##### Pipeline resources #####
##############################
# Number of threads to use per rule/sample for multithreaded rules, multithreading will significantly speed up these rules (diminishing speed gains beyond 8 threads)
THREADS:
# Maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes)
MAXMEMORY: ""
# Maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU)
GPU:
##############################
########## Trimming ##########
##############################
# Whether or not to trim the raw fastq reads (either 'Yes' or 'No')
TRIM: ""
# If trimming, choose the adapter sequence to be trimmed (eg. `--illumina`, `--nextera` or `--small_rna`) or pass adapter sequences to the `-a` and `-a2` flags
TRIMMING:
ADAPTERS: ""
##############################
##### Base recalibration #####
##############################
# List of resources to used for base recalibration
RECALIBRATION:
RESOURCES:
-
-
-
就可以非常优雅的设置软件或者是资源配置的参数了。
比如我们的配置文件如上:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
我们就可以通过函数去访问它们:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
这里使用匿名函数:
lambda wildcards: config["samples"][wildcards.sample]
我们可以像字典一样去访问它,比如当我们传入A
时,即传给了通配符对应的{sample}
,并可以获得对应的值data/samples/A.fastq
。
在shell 工作流中,我们会通过重定向,以将输出保存到文件中。snakemake 同样提供了选项。
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_map/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
ps:snakemake 会自动创建子目录,看着流程运转,目录里的文件填满,真舒服啊。
比如在下面流程中:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_map/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
当执行samtools_sort 规则时,即会删除temp 下的文件。我们需要的是排序后的bam,那之前的bam 也确实可以删除节约空间。
而被protected 的文件,无论snakemake 流程如何执行(--forceall
),文件始终不会被删除或覆写。
rule 的shell 区块可以直接另起一段引号换行,相当于shell 脚本中\
的换行目的了:
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
[1]
Snakemake Tutorial: https://snakemake.bitbucket.io/snakemake-tutorial.html
[2]
Configuration — Snakemake 7.8.0 documentation: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html
[3]
human_genomics_pipeline/config.yaml at master · ESR-NZ/human_genomics_pipeline (github.com): https://github.com/ESR-NZ/human_genomics_pipeline/blob/master/config/config.yaml