Question

How to automate bwa, samtools, bcftools upto variant calling steps without saving intermediate file

1

Entering edit mode

2.7 years ago

Ganaile ▴ 10

I am new to NGS and my task is to assemble PE files from WGRS data to a reference genome and call for variants. I know how to run bwa, samtools and bcftools separately and use their corresponding output files to use as input . I would like to avoid saving the intermediate big output files such as sam files and instead use pipe (|) to proceed for example to next step bwa and the likes until the final variant call steps. Appreciate any help with this.

bcftools sam bwa samtools • 1.7k views

ADD COMMENT • link updated 2.7 years ago by jkbonfield ★ 1.3k • written 2.7 years ago by Ganaile ▴ 10

score 4 · Answer 1 · 2022-03-17

4

Entering edit mode

2.7 years ago

Pierre Lindenbaum 164k

for example:

bwa mem ref.fa R1.fq.gz R2.fq.gz | samtools sort -T tmp -O BAM -o sorted.bam

I would like to avoid saving the intermediate big output files such as sam

you can also use a workflow manager and specify the temporary files: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#protected-and-temporary-files

Further, an output file marked as temp is deleted after all rules that use it as an input are completed:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        temp("path/to/outputfile")
    shell:
        "somecommand {input} {output}"

ADD COMMENT • link 2.7 years ago by Pierre Lindenbaum 164k

3

Entering edit mode

In addition to the above, you may wish to consider a few additional tweaks. When piping multiple samtools or bcftools commands together, it makes no sense to be bgzipping the output between the pipes, so make use of -u in samtools and -Ou in bcftools. Note that the sort step is something of a crunch point, so that's a logical step to save the output (which obviously avoids having to restart everything if you need to tweak the parameters for a later stage).

Eg for bam something like (untested!):

bwa mem ref.fa R1.fq.gz R2.fq.gz |  \
    samtools fixmate -u -m - - | \
    samtools sort -@8 -u - - | \
    samtools markdup -@8 - dat.bam

The fixmate -m and markdup either side of sort avoids needing an additional sorting or collation stage in duplicate marking. It gathers data about multiple alignments from the same template in the first stage, adds them to aux tags, does the sorting, and then uses those aux tags written earlier for the duplicate marking algorithm. This is therefore substantially quicker than most implementations and avoids needless temporary files and re-sorting stages. Note due to the crunch point of sort, it's unlikely the 8 threads I specified for markdup writing out of BAM will run concurrently with the 8 threads given to sort.

Then in bcftools land, something like (also untested):

bcftools mpileup -Ou -f $HREF38 dat.bam | \
    bcftools call -Ou -vm - | \
    bcftools norm -f $HREF38 -Oz -o dat.vcf.gz -

I may have got some options wrong there, like what needs "-" for stdin / stdout, but that's the basic gist of it. You can tweak it further, eg adding more threads to things that will benefit (sort mainly and the final BAM output) or more memory to sort, but remember it's bizarre and is the memory per-thread.

Edit: also in older versions of samtools not all commands had a -u option for uncompressed output. If you hit that problem, you can do the laborious -O bam,level=0 alternative.

ADD REPLY • link 2.7 years ago by jkbonfield ★ 1.3k

score 0 · Answer 2 · 2022-03-17

What you want to do is to create a pipeline. There are various tools for creating pipelines, like snakemake or nextflow. Shell command pipelines are unwieldy in my experience and since all the tools you mentioned are accessed via shell commands (bash, or what have you), I can recommend snakemake.

You can check how to install it here. Snakemake is smart, and you can include temp brackets around the output or your input files - you can see that here under Temporary files.

EDIT: Just adding on to what Pierre said, snakemake executes rules according to connections of input and output files specified in rules. For example, if input of rule 1 is the output of rule 2, rule 2 will be executed before rule 1. But if you have a rule 3 somewhere in your snakefile that doesn't contribute to workflow in any way, it won't be executed. Snakemake usually takes the first rule you defined, and sees what input of that rule is, and what rules are all needed to be executed to get the first rule you defined. So you put the thing you get the last at the very top of your snakefile.

If you need help with snakemake, you can DM anytime.