How to automate bwa, samtools, bcftools upto variant calling steps without saving intermediate file
2
1
Entering edit mode
2.7 years ago
Ganaile ▴ 10

I am new to NGS and my task is to assemble PE files from WGRS data to a reference genome and call for variants. I know how to run bwa, samtools and bcftools separately and use their corresponding output files to use as input . I would like to avoid saving the intermediate big output files such as sam files and instead use pipe (|) to proceed for example to next step bwa and the likes until the final variant call steps. Appreciate any help with this.

bcftools sam bwa samtools • 1.7k views
ADD COMMENT
4
Entering edit mode
2.7 years ago

for example:

bwa mem ref.fa R1.fq.gz R2.fq.gz | samtools sort -T tmp -O BAM -o sorted.bam

I would like to avoid saving the intermediate big output files such as sam

you can also use a workflow manager and specify the temporary files: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#protected-and-temporary-files

Further, an output file marked as temp is deleted after all rules that use it as an input are completed:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        temp("path/to/outputfile")
    shell:
        "somecommand {input} {output}"
ADD COMMENT
3
Entering edit mode

In addition to the above, you may wish to consider a few additional tweaks. When piping multiple samtools or bcftools commands together, it makes no sense to be bgzipping the output between the pipes, so make use of -u in samtools and -Ou in bcftools. Note that the sort step is something of a crunch point, so that's a logical step to save the output (which obviously avoids having to restart everything if you need to tweak the parameters for a later stage).

Eg for bam something like (untested!):

bwa mem ref.fa R1.fq.gz R2.fq.gz |  \
    samtools fixmate -u -m - - | \
    samtools sort -@8 -u - - | \
    samtools markdup -@8 - dat.bam

The fixmate -m and markdup either side of sort avoids needing an additional sorting or collation stage in duplicate marking. It gathers data about multiple alignments from the same template in the first stage, adds them to aux tags, does the sorting, and then uses those aux tags written earlier for the duplicate marking algorithm. This is therefore substantially quicker than most implementations and avoids needless temporary files and re-sorting stages. Note due to the crunch point of sort, it's unlikely the 8 threads I specified for markdup writing out of BAM will run concurrently with the 8 threads given to sort.

Then in bcftools land, something like (also untested):

bcftools mpileup -Ou -f $HREF38 dat.bam | \
    bcftools call -Ou -vm - | \
    bcftools norm -f $HREF38 -Oz -o dat.vcf.gz -

I may have got some options wrong there, like what needs "-" for stdin / stdout, but that's the basic gist of it. You can tweak it further, eg adding more threads to things that will benefit (sort mainly and the final BAM output) or more memory to sort, but remember it's bizarre and is the memory per-thread.

Edit: also in older versions of samtools not all commands had a -u option for uncompressed output. If you hit that problem, you can do the laborious -O bam,level=0 alternative.

ADD REPLY
0
Entering edit mode
2.7 years ago
Ivan ▴ 60

What you want to do is to create a pipeline. There are various tools for creating pipelines, like snakemake or nextflow. Shell command pipelines are unwieldy in my experience and since all the tools you mentioned are accessed via shell commands (bash, or what have you), I can recommend snakemake.

You can check how to install it here. Snakemake is smart, and you can include temp brackets around the output or your input files - you can see that here under Temporary files.

EDIT: Just adding on to what Pierre said, snakemake executes rules according to connections of input and output files specified in rules. For example, if input of rule 1 is the output of rule 2, rule 2 will be executed before rule 1. But if you have a rule 3 somewhere in your snakefile that doesn't contribute to workflow in any way, it won't be executed. Snakemake usually takes the first rule you defined, and sees what input of that rule is, and what rules are all needed to be executed to get the first rule you defined. So you put the thing you get the last at the very top of your snakefile.

If you need help with snakemake, you can DM anytime.

ADD COMMENT
0
Entering edit mode

None of those are true "pipelines" in the UNIX sense though. They run a series of jobs, saving the output from each to use as the input to the next. This is the exact opposite of what the poster is asking for, which is a genuine UNIX pipe.

ADD REPLY

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6