If I want to merge a largish (>1000) number of VCF files, I find that it is useful to follow a hierarchical strategy: split the initial set of VCF files into many smaller sets of VCF files, merge each of these smaller sets in parallel (using bcftools merge
), and then merge the resulting intermediate merged VCFs.
(Of course, the above description is the simplest ("one-level") version of this idea, but one can carry it out at multiple levels, recursively, if the initial number of files is very large.)
What makes this strategy effective is the possibility of running many smaller merge jobs in parallel on a computation cluster. (A similar strategy is the basis of the merge-sort algorithm.)
As I said, I've found this general strategy to be pretty useful when it comes to merging many VCF files, but then I run into the problem of generating the index file for the (typically ragher huge) resulting merged VCF. This can be a pretty time-consuming affair of its own right.
Therefore, it would be great if one could apply the same "hierarchical merge" idea to the problem of indexing the resulting VCF. This would require a procedure to generate the index file for a merged VCF that uses as its input the index files of the VCF files that went into the merge. Furthermore (and this is key!) this procedure must be significantly faster than that of generating the index file directly from the merged VCF.
Does anyone know if the above "index merge" strategy is possible using readily-available software?
Alternatively, is there some other way to accelerate the creation of an index file for a VCF file that somehow takes advantage of previously computed index files?
when merging some bgzipped files, you'll alter the compression efficiency (same data in the same region = better compression), hence you'll change the virtual bgzip-offsets and I don't think there is any way to 'predict' the offsets of this merged index
Ah ! and don't use vcf but BCF, and uncompressed bcf for the intermediate files.
That's a very interesting idea, I'll definitely try it out, but it looks to me more like an optimization for the hierarchical merge, rather than for the indexing. Please correct me if I'm wrong.
using bcf speeds up the parsing, hence building the index will be faster.
OK, now I'm confused. As far as I can tell, the above does not include any way to use the indices of the intermediate files to speed up the indexing of the final merged VCF. If this is correct, then there would be no point in building the indices for the intermediate files. (Hence, the fact that intermediate files in BCF format would be faster to index would not play a role.)
I can see that if the final merge produces a BCF file, rather than a (compressed) VCF file, then, as you just wrote, the indexing of this BCF file will be faster. But, if we want our final merged data to be in the form of a compressed VCF file, will the index generated from the BCF file remain valid for the compressed VCF?
(Also, even if the answer to my last question is "yes", one would also need to verify that the time needed to generate a compressed VCF from the final BCF is substantially less than the time one saves by indexing the uncompressed BCF instead of the compressed VCF.)
if your downstream tools are able to handle BCF (eg: bcftools but not GATK), if you need SPEED and something compact (if there are many samples) then BCF should be your final format of choice. Otherwise, of course, you have to switch to VCF.gz and the final indexing with be slower than with BCF.
OK, thanks, that is good to know.
As it happens, in this case, for institutional, rather than technical, reasons, we have to stick with VCF.gz for the final merged file.