前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >流程管理工具snakemake学习笔记杂记

流程管理工具snakemake学习笔记杂记

作者头像
用户7010445
发布2022-05-23 15:53:03
9110
发布2022-05-23 15:53:03
举报
文章被收录于专栏:小明的数据分析笔记本

snakemake学习笔记001:使用fastp对原始数据过滤

参考

  • 1 https://www.jianshu.com/p/14b9eccc0c0e
  • 2 https://stackoverflow.com/questions/56271154/use-snakemake-pair-end-bwa-alignment
  • 3 https://bioinformaticsonline.com/snippets/view/43590/rules-to-run-fastp-snakemake
  • 4 https://xizhihui.github.io/2018/10/28/%E6%B5%81%E7%A8%8B%E6%9E%84%E5%BB%BA-Snakemake%E4%BD%BF%E7%94%A8%E5%88%9D%E6%AD%A5/
  • 5 https://www.youtube.com/watch?v=_wUGzqEjg6A&t=996s
  • 6 https://www.youtube.com/watch?v=Bdxh0fSBOSA
  • 7 https://www.youtube.com/watch?v=aZ0QlqtR2KM&t=279s
代码语言:javascript
复制
import os
import glob

input_folder = "/home/myan/scratch/private/pome_WGS/huoyan_14_variety/"
output_folder = "/home/myan/scratch/private/practice_data/snakemake/20220412/fastp.results/"

SRA,FRR = glob_wildcards(input_folder+"{sra}_{frr}.fq.gz")

rule all:
    input:
        expand(output_folder+"{sra}.html",sra=SRA),
        expand(output_folder+"{sra}.json",sra=SRA)


rule fastp:
    input:
        read01 = input_folder + "{sra}_1.fq.gz",
        read02 = input_folder + "{sra}_2.fq.gz"
    output:
        read01 = output_folder + "{sra}_1.clean.fq.gz",
        read02 = output_folder + "{sra}_2.clean.fq.gz",
        html = output_folder + "{sra}.html",
        json = output_folder + "{sra}.json"
    threads:
        8
    shell:
        """
        fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} --thread {threads} --html {output.html} --json {output.json}
        """

这里rule all的作用还是没有搞明白,看有的文档说是最终保留的文件 ,我这里rule all 只写了了最终的html和json,但是最终的结果里是有过滤后的fastq文件的

还有好多基础知识需要看

路径里的文件夹如果不存在会新建一个文件夹

snakemake学习笔记002:hisat2+samtools+stringtie流程转录组分析

今天的内容增加了config文件

代码语言:javascript
复制
input_folder:
    "/home/myan/scratch/private/practice_data/RNAseq/chrX_data/"
output_folder:
    "/home/myan/scratch/private/practice_data/RNAseq/snakemake.rnaseq/"

index_folder:
    "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/indexes/"

index_name:
    "chrX_tran"

samples_folder:
    "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/samples/"

gtf_folder:
    "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/genes/chrX.gtf"

config文件主要用来指定文件的存贮路径

snakemake文件的内容

代码语言:javascript
复制
configfile: "config.yaml"

import os
import glob
print(config)
print(config['input_folder'])
print(config['output_folder'])

SRR,FRR = glob_wildcards(config['samples_folder']+"{srr}_chrX_{frr}.fastq.gz")

rule all:
    input:
        expand(config["output_folder"] + "output.sam/{srr}.sam",srr=SRR),
        expand(config['output_folder'] + "output.bam/{srr}.bam",srr=SRR),
        expand(config['output_folder'] + "output.gtf/{srr}.gtf",srr=SRR)

rule hisat2:
    input:
        read01 = config['samples_folder'] + "{srr}_chrX_1.fastq.gz",
        read02 = config['samples_folder'] + "{srr}_chrX_2.fastq.gz",

    output:
        sam = config['output_folder'] + "output.sam/{srr}.sam"
    threads:
        8
    params:
        index_file = config['index_folder']+config['index_name']
    shell:
        """
        hisat2 -p {threads} --dta -x {params.index_file} -1 {input.read01} -2 {input.read02} -S {output.sam}
        """

rule samtools:
    input:
        sam = rules.hisat2.output.sam
    output:
        bam = config['output_folder'] + "output.bam/{srr}.bam"
    threads:
        8
    shell:
        """
        samtools sort -@ {threads} -o {output.bam} {input.sam}
        """

rule stringtie:
    input:
        gtf = config['gtf_folder'],
        bam = rules.samtools.output.bam
    output:
        gtf = config['output_folder'] + "output.gtf/{srr}.gtf"
    params:
        l = "{srr}"
    threads:
        8
    shell:
        """
        stringtie -p {threads} -G {input.gtf} -o {output.gtf} -l {params.l} {input.bam}
        """

后面转录本定量的步骤还没有写完,好像还可以把差异表达分析的脚本嵌入进来

未完待续

示例数据用到的是论文

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie, and Ballgown

中的数据

snakemake学习笔记003:stringtie合并转录本

代码语言:javascript
复制
SRR, = glob_wildcards("output.gtf/"+"{srr}.gtf")
#SRR = ["ERR188401","ERR204916"]
#rule all:
#    input:
#        expand("output.gtf/"+"{srr}.gtf",srr=SRR),

rule gtflist:
    input:
        gtffiles = expand("output.gtf/"+"{srr}.gtf",srr=SRR)
    output:
        output_txt = "MergedList.txt"
    run:
        with open(output.output_txt,'w') as f:
            for gtf in input.gtffiles:
                print(gtf,file=f)

第一行SRR后面有一个逗号

最开始没写这个逗号,一直遇到报错

image.png

Building DAG of jobs... MissingInputException in line 3 of /mnt/shared/scratch/myan/private/practice_data/RNAseq/snakemake.rnaseq/gtf_list.py: Missing input files for rule all: output.gtf/['ERR188337', 'ERR188245', 'ERR188428', 'ERR188401', 'ERR204916', 'ERR188383', 'ERR188104', 'ERR188257', 'ERR188273', 'ERR188044', 'ERR188234', 'ERR188454'].gtf

snakemake学习笔记004:前后两个rule为啥连不起来呢?

代码语言:javascript
复制
SRR, = glob_wildcards("output.gtf/"+"{srr}.gtf")
#SRR = ["ERR188401","ERR204916"]
rule all:
    input:
        #expand("output.gtf/"+"{srr}.gtf",srr=SRR),
        "MergedList.txt",

rule gtflist:
    input:
        gtffiles = expand("output.gtf/"+"{srr}.gtf",srr=SRR)
    output:
        output_txt = "MergedList.txt"
    run:
        with open(output.output_txt,'w') as f:
            for gtf in input.gtffiles:
                print(gtf,file=f)



rule stringtieMerge:
    input:
        refgtf = "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/genes/chrX.gtf",
        gtflist = "MergedList.txt"
    output:
        gtf = "output.merged.gtf/StringtieMerged.gtf"
    log:
        "output.merged.gtf/StringtieMerge.logs"
    threads:
        1
    params:
        "-m 200 -c 3"
    shell:
        """
        stringtie --merge {params} -p {threads} -G {input.refgtf} -o {output.gtf} {input.gtflist}
        """

第二个rule就是不运行

原来是在rule all 代码里少写了 第二个rule的输出文件

正确写法是

代码语言:javascript
复制
SRR, = glob_wildcards("output.gtf/"+"{srr}.gtf")
#SRR = ["ERR188401","ERR204916"]
rule all:
    input:
        #expand("output.gtf/"+"{srr}.gtf",srr=SRR),
        "MergedList.txt",
        "output.merged.gtf/StringtieMerged.gtf"

rule gtflist:
    input:
        gtffiles = expand("output.gtf/"+"{srr}.gtf",srr=SRR)
    output:
        output_txt = "MergedList.txt"
    run:
        with open(output.output_txt,'w') as f:
            for gtf in input.gtffiles:
                print(gtf,file=f)



rule stringtieMerge:
    input:
        refgtf = "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/genes/chrX.gtf",
        gtflist = "MergedList.txt"
    output:
        gtf = "output.merged.gtf/StringtieMerged.gtf"
    log:
        "output.merged.gtf/StringtieMerge.logs"
    threads:
        1
    params:
        "-m 200 -c 3"
    shell:
        """
        stringtie --merge {params} -p {threads} -G {input.refgtf} -o {output.gtf} {input.gtflist}
        """

rule all 的作用是啥我还是不懂

snakemake学习笔记005:流程里使用R脚本

代码语言:javascript
复制
rule useRscriptinsnakemake:
    input:
        ballgown_folder = "ballgown",
        pheno = "geuvadis_phenodata.csv"
    output:
        rdat = "bg_chrX_1.Rdata"
    #conda:
        #"envs/ballgown.yaml"
    script:
        "scripts/ballgown_1.R"

尝试嵌入conda的时候遇到报错,暂时不知道是什么原因 我的ballgown.yaml文件

代码语言:javascript
复制
name: rnaseq_pra
channels:
  - conda-forge
  - biocondas
dependencies:
  - r=4.0=r40hd8ed1ab_1004
  - bioconductor-ballgown=2.22.0=r40hdfd78af_1

我的ballgown_1.R文件

代码语言:javascript
复制
library(ballgown)
library(genefilter)

pheno_data = read.csv(snakemake@input[["pheno"]])
bg_chrX = ballgown(dataDir = snakemake@input[["ballgown_folder"]],samplePattern = "ERR",pData = pheno_data)
bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) > 1",genomesubset=TRUE)
results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars=c("population"),getFC=TRUE,meas="FPKM")
save(bg_chrX,results_transcripts,file=snakemake@output[["rdat"]])

这里有一个问题是snakemake流程里怎么样使用已经存在的conda环境,看这个流程的时候 https://github.com/Alipe2021/NLncCirSmk/blob/master/NLncCirSmk.smk

他的写法是

代码语言:javascript
复制
rule Part06_CircRNA_Analysis_06_Bam2Anchors:
    input:
        bam = OUTPUTDIR + "./CircRNA/05.UnmappedBam/{sample}.unmapped.bam",
        bai = OUTPUTDIR + "./CircRNA/05.UnmappedBam/{sample}.unmapped.bai"
    output:
        fq = OUTPUTDIR + "./CircRNA/06.Bam2Anchors/{sample}/unmapped_anchors.fq.gz"
    params:
        DIR = OUTPUTDIR + "./CircRNA/06.Bam2Anchors/{sample}"
    run:
        import os
        import subprocess
        if not os.path.exists(params.DIR):
            os.makedirs(params.DIR)

        cmd = ("source activate find_circ_env && unmapped2anchors.py {bam} | "
        " gzip > {fq} && conda deactivate").format(bam=input.bam, fq=output.fq)

        print(cmd)
        subprocess.call(cmd, shell=True)
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-05-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 小明的数据分析笔记本 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • snakemake学习笔记001:使用fastp对原始数据过滤
    • 参考
    • snakemake学习笔记002:hisat2+samtools+stringtie流程转录组分析
    • snakemake学习笔记003:stringtie合并转录本
      • 第一行SRR后面有一个逗号
      • snakemake学习笔记004:前后两个rule为啥连不起来呢?
      • snakemake学习笔记005:流程里使用R脚本
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档