speed up bcftools annotate command
0
0
Entering edit mode
6.2 years ago

I'm annotating whole genome VCFs with dbSNP identifiers with the following command:

bcftools annotate -a "indexed VCF.gz" -c ID "target indexed VCF.gz" > "output.vcf"

this works perfectly, but takes about 15 minutes on our server (all data on ssd based partitions). For small target VCFs containing for example a target region on chrY it helps to add the region to the command with -R regions.txt, but not for larger genome wide VCFs.

Is there any way to speed up the annotation process?

next-gen snp bcftools • 6.8k views
ADD COMMENT
2
Entering edit mode

Hi, did you try GNU parallel? Here are some examples. I have good experiences to speed up all my processing tasks with parallel.

ADD REPLY
1
Entering edit mode

split the analysis per chromosome

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

Hi, using the BCF format is quicker for that kind of command

ADD REPLY
0
Entering edit mode

Where do you recommend using the BCF format? As the output format, or as the annotation file or as the input file? How does that compare against using a VCF in those places, and what about the VCF <-> BCF (or vice versa) conversion times? Please give us more details on your suggestions.

ADD REPLY
0
Entering edit mode

Annotate has a option for multi threading --threads

http://samtools.github.io/bcftools/bcftools.html#annotate

ADD REPLY
1
Entering edit mode

--threads INT Use multithreading with INT worker threads. The option is currently used only for the compression of the output stream, only when --output-type is b or z. Default: 0.

as OP generates uncompressed VCF, this option would be useless

ADD REPLY
0
Entering edit mode

Ok, thanks for the clarification. So just to make sure I understand. If I have say 2 bcfs:

a.bcf and b.bcf

I go:

bcftools merge --threads 64 a.bcf b.bcf -Ob > merged.bcf

Am I right in understanding that a.bcf and b.bcf are decompressed with one thread, then merged, and the 64 threads are used to convert the stream of that merged bcf into merged.bcf?

ADD REPLY
0
Entering edit mode

Gook remark. I don't know.

ADD REPLY

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6