Entering edit mode
6.1 years ago
fenrir.sivar
•
0
I'm annotating whole genome VCFs with dbSNP identifiers with the following command:
bcftools annotate -a "indexed VCF.gz" -c ID "target indexed VCF.gz" > "output.vcf"
this works perfectly, but takes about 15 minutes on our server (all data on ssd based partitions). For small target VCFs containing for example a target region on chrY it helps to add the region to the command with -R regions.txt, but not for larger genome wide VCFs.
Is there any way to speed up the annotation process?
Hi, did you try GNU parallel? Here are some examples. I have good experiences to speed up all my processing tasks with parallel.
split the analysis per chromosome
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Hi, using the BCF format is quicker for that kind of command
Where do you recommend using the BCF format? As the output format, or as the annotation file or as the input file? How does that compare against using a VCF in those places, and what about the VCF <-> BCF (or vice versa) conversion times? Please give us more details on your suggestions.
Annotate has a option for multi threading
--threads
http://samtools.github.io/bcftools/bcftools.html#annotate
as OP generates uncompressed VCF, this option would be useless
Ok, thanks for the clarification. So just to make sure I understand. If I have say 2 bcfs:
a.bcf
andb.bcf
I go:
bcftools merge --threads 64 a.bcf b.bcf -Ob > merged.bcf
Am I right in understanding that
a.bcf
andb.bcf
are decompressed with one thread, then merged, and the 64 threads are used to convert the stream of that merged bcf intomerged.bcf
?Gook remark. I don't know.