Question

Using GNU parallel to speed up merging VCFs with bcftools

1

Entering edit mode

4.2 years ago

mfshiller ▴ 20

I have a bunch of VCF files to merge. Bcftools isn't being able to handle everything so I have to merge in batches. I would like to use GNU parallel to do this because I'm working on an Amazon EC2 instance through PuTTy which sometimes crashes, leaving the process unfinished. How could I do this?

Edit: This is what I ended up doing, in case this is useful to anyone down the line:

ls *vcf.gz > vcf.list
parallel --max-args 30 bcftools merge {} -Oz -o batch_merge{#}.vcf.gz :::: vcf.list

This is merging batches of 30 files in parallel. I had almost 900 vcfs to merge so it went pretty quickly.

vcf parallel gwas bcftools bash • 3.2k views

ADD COMMENT • link 4.2 years ago by mfshiller ▴ 20

0

Entering edit mode

you might be interested in some of the comments here; https://shicheng-guo.github.io/bioinformatics/1923/02/28/bcftools-merge

ADD REPLY • link 4.2 years ago by steve ★ 3.5k

score 2 · Answer 1 · 2021-02-08

2

Entering edit mode

4.2 years ago

ole.tange ★ 4.5k

First: Learn tmux. This way you can let PuTTY crash, and you can reconnect and continue where you left off. It is also excellent for starting a job at work, and then reconnecting when you get home to see how it is doing.

I use: tmux, CTRL-b CTRL-c, CTRL-b CTRL-n, CTRL-b CTRL-p, CTRL-b CTRL-d, and tmux at.

See more at: https://www.hostinger.com/tutorials/tmux-beginners-guide-and-cheat-sheet/

Then look at GNU Parallel. Read chapter 1+2 of https://zenodo.org/record/1146014 It should take you no more than 15 minutes.

Then write out the complete commands you want to run. When you have written out 3, you can see there is a pattern. Try to replicate that pattern with GNU Parallel - use --dryrun to see if you made it do the right thing.

GNU Parallel is not magic: If you do not know how to run the commands by hand in serial, it is unlikely you will be able to make GNU Parallel do it for you.

ADD COMMENT • link 4.2 years ago by ole.tange ★ 4.5k

0

Entering edit mode

I recently had to do the same thing, and also accomplished it with GNU parallel. Code here. But basically;

find inputs_dir -type f | parallel --jobs 1 --xargs bedops -m {} ">" merged.{#}.bed
find . -maxdepth 1 ! -path 'inputs_dir*' -type f -name "merged.*.bed" | parallel bedops -m {} ">" merged.bed

I am using Bedops with .bed files, replace that with your bcftools commands and .vcf files. I took a naive "two pass" approach, assuming that the number of output files from the first command will be small enough to combine in a single command with the second. But if you really do have massive numbers of files, you might need to wrap this in a for or while loop that keeps merging until there are no files left un-merged. The important part being that parallel --xargs will batch the input files up into groups that are small enough to fit on the command line, leaving you with multiple intermediary merge products that you can clean up with one more merge command.

ADD REPLY • link 4.2 years ago by steve ★ 3.5k

0

Entering edit mode

Thanks for your suggestion! I ended up doing something pretty similar (see edited OP), using --max-args instead of --xargs.

ADD REPLY • link 4.2 years ago by mfshiller ▴ 20

0

Entering edit mode

Thanks. I ended up finding a solution with Parallel that worked very well for me (see edited OP).

ADD REPLY • link 4.2 years ago by mfshiller ▴ 20

score 0 · Answer 2 · 2021-02-08

0

Entering edit mode

4.2 years ago

ATpoint 87k

A: How to parallelize bcftools mpileup with GNU parallel?

A: Samtools mpileup taking a long time to finish

The idea can be generalized though. You could also define the batches up front and then use SLURM (or whatever scheduler use use) arrays to submit a job for each batch.

ADD COMMENT • link 4.2 years ago by ATpoint 87k