前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >snakemake杂记:多个转录组比对到多个基因组得到多个bam文件然后合并

snakemake杂记:多个转录组比对到多个基因组得到多个bam文件然后合并

作者头像
用户7010445
发布2023-08-23 10:41:21
2000
发布2023-08-23 10:41:21
举报

我的需求是:

我有10个基因组,然后又12个转录组数据,然后将这个12个基因组数据分别比对到这个10个基因组,每个基因组得到12个bam文件,然后将每个基因组的12个bam文件合并 ,最终得到10个合并的bam文件

前面比对的步骤没有遇到问题

代码语言:javascript
复制
SAMPLES, = glob_wildcards("../../rnaseq/01.clean.reads/{sample}_clean_1.fq")
GENOMES, = glob_wildcards("{genome}.fa.masked")

GENOMES = [genome.split("/")[0] for genome in GENOMES if len(genome.split("/")) == 3]
GENOMES.remove("D1")

print(GENOMES)
print(SAMPLES)

print("Total genome size: ",len(GENOMES))
print("Total sample size: ",len(SAMPLES))




rule all:
    input:
        expand("{genome}/{genome}/ref_index.1.ht2",genome=GENOMES),
        expand("{genome}/{genome}/{sample}.sam",genome=GENOMES,sample=SAMPLES),
        expand("{genome}/{genome}/{sample}.sorted.bam.bai",genome=GENOMES,sample=SAMPLES),
        expand("{genome}/{genome}/merged.bam",genome=GENOMES),
        expand("{genome}/{genome}/merged.gtf",genome=GENOMES)


rule hisat2_build:
    input:
        ref = "{genome}/{genome}/{genome}.fa.masked"
    output:
        "{genome}/{genome}/ref_index.1.ht2"
    threads:
        8
    resources:
        mem_mb = 48000
    params:
        "{genome}/{genome}/ref_index"
    shell:
        """
        hisat2-build {input.ref} {params} -p {threads}
        """

rule hisat2_align:
    input:
        index = rules.hisat2_build.output,
        r1 = "../../rnaseq/01.clean.reads/{sample}_clean_1.fq",
        r2 = "../../rnaseq/01.clean.reads/{sample}_clean_2.fq"
    output:
        sam = "{genome}/{genome}/{sample}.sam"
    threads:
        8
    resources:
        mem_mb = 64000
    params:
        ref_index = rules.hisat2_build.params
    shell:
        """
        hisat2 -p {threads} --dta -x {params.ref_index} \
        -1 {input.r1} -2 {input.r2} -S {output.sam}
        """

rule samtools_sort:
    input:
        sam = rules.hisat2_align.output.sam
    output:
        bam = "{genome}/{genome}/{sample}.sorted.bam"
    threads:
        8
    resources:
        mem_mb = 64000
    shell:
        """
        samtools sort -@ {threads} -O bam -o {output.bam} {input.sam}
        """

rule samtools_index:
    input:
        rules.samtools_sort.output.bam
    output:
        "{genome}/{genome}/{sample}.sorted.bam.bai"
    threads:
        8
    resources:
        mem_mb = 24000
    shell:
        """
        samtools index {input}
        """

到合并的步骤最开始的写法是

代码语言:javascript
复制
rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=GENOMES,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=GENOMES,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这样写的问题是合并的时候每个基因组对应的是120个bam文件

然后换成下面的写法

代码语言:javascript
复制
rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个报错提示

name 'wildcards' is not defined

然后又换成下面的写法

代码语言:javascript
复制
def getGenomes(wildcards):
  return wildcards.genome

rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=getGenome,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=getGenome,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个还是报错,报错内容忘记截图了,而且报错很诡异

然后以关键词 snakemake lambda wildcards expand 搜索,找到了一个链接

https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command

这里面提到

'wildcards' is not "directly" defined in the input of a rule. You need to use a [function of 'wildcards'](http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#functions-as-input-files) instead. I'm not sure I understand exactly what you want to do, but you may try something like that.

这里提供了一个写法

代码语言:javascript
复制
def condition2tumorsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['tumor'])

def condition2normalsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['normal'])

rule gatk_RealignerTargetCreator:
    input:
        tumor = condition2tumorsamples,
        normal = condition2normalsamples,    
    output:
        "mapped_reads/merged_samples/{condition}.realign.intervals"
    # remainder of the rule here...

安装这个思路我写我自己的

代码语言:javascript
复制
def getGenome01(wildcards):
    return expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES)

def getGenome02(wildcards):
    return expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)

rule samtools_merge:
    input:
        bams = getGenome01,
        bai = getGenome02
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个是可以正常运行的

推文记录的是自己的学习笔记,内容可能会存在错误,请大家批判着看,欢迎大家指出其中的错误

欢迎大家关注我的公众号

小明的数据分析笔记本

小明的数据分析笔记本 公众号 主要分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记;3、生物信息学入门学习资料及自己的学习笔记!

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2023-05-18,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 小明的数据分析笔记本 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档