I want to get first and last position in a VCF for chunking purposes
First position is easy to get:
bcftools query -f '%CHROM\t%POS\n' my_vcf | head -n 1
I realize I could stream the whole vcf to find last position, but I was hoping there was some faster way to get it.
The only thing I can think of is some kind of binary search like approach, but there has to be a better way
Maybe I didn't understand the question, but if the VCF is sorted, why not just use
tail
?getting to the tail of a 10GB compressed file will take quite a bit and means unpacking the entire file.
As for the OP, perhaps reading the index file with a custom program might be able to tell you the last coordinate then you access that directly.
Do you any idea how to do this, I looked for this approach even before positing this. I cant cat or zcat the index and get anything readable and I cant find any helpful info with 'bcftools index' or 'tabix' options
The descriptions for most formats can be found here:
https://github.com/samtools/hts-specs
For example, the default indexing for VCF is CSI, I believe.
when poking around this way I found this, which is already doing my chunking to boot:
https://pystatgen.github.io/sgkit/latest/vcf.html#partitioning
I have been playing with this problem a bit in Python, and I concluded that you would need a bgzip aware solution for it to work. It does not seem like
sgkit
is such a tool though.When using the builtin gzip stream Python will proceed to uncompress the entire stream thus, even if you figured out the correct offset it won't be able to jump to that location without unzipping everything to reach the end.
Alas, I was unable to locate a bgzip library in Python that also allows seeking a file to an offset.
That's an interesting question but I'm not sure a tabix index is relevant to what position happens to be at the end of the file. Unless it's position sorted, the vcf could end with chr1:123, for instance, instead of chrM:16569. The tabix index doesn't really care.
for tabix to work, the file has to be position sorted, and grouped by chromosome (usually we sort by both):
as seen in: http://www.htslib.org/doc/tabix.html