speed up bcftools annotate command

Entering edit mode

6.9 years ago

fenrir.sivar • 0

I'm annotating whole genome VCFs with dbSNP identifiers with the following command:

bcftools annotate -a "indexed VCF.gz" -c ID "target indexed VCF.gz" > "output.vcf"

this works perfectly, but takes about 15 minutes on our server (all data on ssd based partitions). For small target VCFs containing for example a target region on chrY it helps to add the region to the command with -R regions.txt, but not for larger genome wide VCFs.

Is there any way to speed up the annotation process?

next-gen snp bcftools • 7.6k views

ADD COMMENT • link updated 5.5 years ago by robby.concha-eloko • 0 • written 6.9 years ago by fenrir.sivar • 0

Entering edit mode

Hi, did you try GNU parallel? Here are some examples. I have good experiences to speed up all my processing tasks with parallel.

ADD REPLY • link 6.9 years ago by Paul ★ 1.5k

Entering edit mode

split the analysis per chromosome

ADD REPLY • link 6.9 years ago by Pierre Lindenbaum 166k

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY • link 6.9 years ago by Ram 45k

Entering edit mode

Hi, using the BCF format is quicker for that kind of command

ADD REPLY • link 5.5 years ago by robby.concha-eloko • 0

Entering edit mode

Where do you recommend using the BCF format? As the output format, or as the annotation file or as the input file? How does that compare against using a VCF in those places, and what about the VCF <-> BCF (or vice versa) conversion times? Please give us more details on your suggestions.

ADD REPLY • link 5.5 years ago by Ram 45k

Entering edit mode

Annotate has a option for multi threading --threads

http://samtools.github.io/bcftools/bcftools.html#annotate

ADD REPLY • link 5.2 years ago by curious ▴ 890

Entering edit mode

--threads INT Use multithreading with INT worker threads. The option is currently used only for the compression of the output stream, only when --output-type is b or z. Default: 0.

as OP generates uncompressed VCF, this option would be useless

ADD REPLY • link 5.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

Ok, thanks for the clarification. So just to make sure I understand. If I have say 2 bcfs:

a.bcf and b.bcf

I go:

bcftools merge --threads 64 a.bcf b.bcf -Ob > merged.bcf

Am I right in understanding that a.bcf and b.bcf are decompressed with one thread, then merged, and the 64 threads are used to convert the stream of that merged bcf into merged.bcf?

ADD REPLY • link 5.2 years ago by curious ▴ 890

Entering edit mode

Gook remark. I don't know.

ADD REPLY • link 5.2 years ago by Pierre Lindenbaum 166k