workflow05-snakemake的进阶操作一

北野茶缸子

发布于 2022-07-07 14:51:53

1.1K00

代码可运行

文章被收录于专栏：北野茶缸子的专栏北野茶缸子的专栏

运行总次数：0

代码可运行

Date : [[2022-05-29_Sun]]
Tags : #工作流/snakemake
参考：
- Snakemake Tutorial[1]

前言

继续介绍一些snakemake的进阶操作。

1-指定软件使用的线程

如bwa 等软件，我们可以分配多线程以提高任务的执行速度的。同样，我们可以把线程的信息配置在规则中：

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

在snakemake 执行的时候，我们需要制定--cores 参数，设置snakemake 全部任务执行时，不超过的最大线程数。

比如当bwa 规则调用了8个线程，snakemake 则会将剩下的线程分配给其他数据执行bwa 以外的线程消耗数目较少的任务。

比起一般的bash 脚本来说，snakemake 的一个优势在于，其会充分调度线程，以确保线程充分被调度。（真是优秀员工啊！）

2-配置文件

我们可以在snakemake中，将使用的通配符或文件信息，写到config 文件中，并通过config访问：

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

configfile:"config.yaml"

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

详细用法，可以参考：Configuration — Snakemake 7.8.0 documentation[2]

单纯从这点上，我并没有体会到config 的便利。

但是，如果是给外部用户使用呢？或者是应对不同的场景需求，设置参数呢？比如：human_genomics_pipeline/config.yaml at master · ESR-NZ/human_genomics_pipeline (github.com)[3]

提供了一个模板：

##############################
###### Overall workflow ######
##############################

# Type of input data (either 'Single' or 'Cohort')
DATA: ""

# Should the pipeline be GPU accelerated where possible? (either 'Yes' or 'No')
GPU_ACCELERATED: ""

# File path to the reference genome (.fasta)
REFGENOME: ""

# File path to dbSNP database
dbSNP: ""

# Temporary file directory
TEMPDIR: ""

# Whole exome sequence settings (leave blank if analysing other data such as whole genome sequence data)
WES:
  # File path to the exome capture regions over which to operate
  INTERVALS: ""
  # Padding (in bp) to add to each region
  PADDING: ""

##############################
##### Pipeline resources #####
##############################

# Number of threads to use per rule/sample for multithreaded rules, multithreading will significantly speed up these rules (diminishing speed gains beyond 8 threads)
THREADS:

# Maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes)
MAXMEMORY: ""

# Maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU)
GPU:

##############################
########## Trimming ##########
##############################

# Whether or not to trim the raw fastq reads (either 'Yes' or 'No')
TRIM: ""

# If trimming, choose the adapter sequence to be trimmed (eg. `--illumina`, `--nextera` or `--small_rna`) or pass adapter sequences to the `-a` and `-a2` flags
TRIMMING:
  ADAPTERS: ""

##############################
##### Base recalibration #####
##############################

# List of resources to used for base recalibration
RECALIBRATION:
  RESOURCES:
    -
    -
    -

就可以非常优雅的设置软件或者是资源配置的参数了。

3-输入区块引入函数

比如我们的配置文件如上：

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

我们就可以通过函数去访问它们：

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

这里使用匿名函数：

lambda wildcards: config["samples"][wildcards.sample]

我们可以像字典一样去访问它，比如当我们传入A 时，即传给了通配符对应的{sample}，并可以获得对应的值data/samples/A.fastq。

4-日志文件

在shell 工作流中，我们会通过重定向，以将输出保存到文件中。snakemake 同样提供了选项。

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_map/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

ps：snakemake 会自动创建子目录，看着流程运转，目录里的文件填满，真舒服啊。

5-临时与保护文件

比如在下面流程中：

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_map/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"
  
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

当执行samtools_sort 规则时，即会删除temp 下的文件。我们需要的是排序后的bam，那之前的bam 也确实可以删除节约空间。

而被protected 的文件，无论snakemake 流程如何执行（--forceall），文件始终不会被删除或覆写。

其他内容

rule 的shell 区块可以直接另起一段引号换行，相当于shell 脚本中\ 的换行目的了：

    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

参考资料

[1]

Snakemake Tutorial: https://snakemake.bitbucket.io/snakemake-tutorial.html

[2]

Configuration — Snakemake 7.8.0 documentation: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html

[3]

human_genomics_pipeline/config.yaml at master · ESR-NZ/human_genomics_pipeline (github.com): https://github.com/ESR-NZ/human_genomics_pipeline/blob/master/config/config.yaml

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-06-20，如有侵权请联系 cloudcommunity@tencent.com 删除

shell