Question

fast way to get last position in a large indexed VCF?

2

Entering edit mode

2.2 years ago

curious ▴ 810

I want to get first and last position in a VCF for chunking purposes

First position is easy to get:

bcftools query -f '%CHROM\t%POS\n' my_vcf  | head -n 1

I realize I could stream the whole vcf to find last position, but I was hoping there was some faster way to get it.

The only thing I can think of is some kind of binary search like approach, but there has to be a better way

bcftools • 2.9k views

ADD COMMENT • link updated 2.2 years ago by Istvan Albert 101k • written 2.2 years ago by curious ▴ 810

0

Entering edit mode

Maybe I didn't understand the question, but if the VCF is sorted, why not just use tail?

ADD REPLY • link 2.2 years ago by liorglic ★ 1.4k

1

Entering edit mode

getting to the tail of a 10GB compressed file will take quite a bit and means unpacking the entire file.

As for the OP, perhaps reading the index file with a custom program might be able to tell you the last coordinate then you access that directly.

ADD REPLY • link 2.2 years ago by Istvan Albert 101k

0

Entering edit mode

Do you any idea how to do this, I looked for this approach even before positing this. I cant cat or zcat the index and get anything readable and I cant find any helpful info with 'bcftools index' or 'tabix' options

ADD REPLY • link 2.2 years ago by curious ▴ 810

0

Entering edit mode

The descriptions for most formats can be found here:

https://github.com/samtools/hts-specs

For example, the default indexing for VCF is CSI, I believe.

ADD REPLY • link 2.2 years ago by Istvan Albert 101k

0

Entering edit mode

when poking around this way I found this, which is already doing my chunking to boot:

from sgkit.io.vcf import partition_into_regions
partition_into_regions("CEUTrio.20.21.gatk3.4.g.vcf.bgz", num_parts=10)

['20:1-10108928', '20:10108929-10207232', '20:10207233-', '21:1-10027008', '21:10027009-10043392', '21:10043393-10108928', '21:10108929-10141696', '21:10141697-10174464', '21:10174465-10190848', '21:10190849-10207232', '21:10207233-']

https://pystatgen.github.io/sgkit/latest/vcf.html#partitioning

ADD REPLY • link 2.2 years ago by curious ▴ 810

0

Entering edit mode

I have been playing with this problem a bit in Python, and I concluded that you would need a bgzip aware solution for it to work. It does not seem like sgkit is such a tool though.

When using the builtin gzip stream Python will proceed to uncompress the entire stream thus, even if you figured out the correct offset it won't be able to jump to that location without unzipping everything to reach the end.

Alas, I was unable to locate a bgzip library in Python that also allows seeking a file to an offset.

ADD REPLY • link 2.2 years ago by Istvan Albert 101k

0

Entering edit mode

That's an interesting question but I'm not sure a tabix index is relevant to what position happens to be at the end of the file. Unless it's position sorted, the vcf could end with chr1:123, for instance, instead of chrM:16569. The tabix index doesn't really care.

ADD REPLY • link 2.2 years ago by Jeremy Leipzig 22k

1

Entering edit mode

for tabix to work, the file has to be position sorted, and grouped by chromosome (usually we sort by both):

(grep "^#" in.gff; grep -v "^#" in.gff | sort -t"`printf '\t'`" -k1,1 -k4,4n) | bgzip > sorted.gff.gz;

tabix -p gff sorted.gff.gz;

tabix sorted.gff.gz chr1:10,000,000-20,000,000;

as seen in: http://www.htslib.org/doc/tabix.html

ADD REPLY • link 2.2 years ago by Istvan Albert 101k

score 1 · Answer 1 · 2022-09-11

1

Entering edit mode

2.2 years ago

Istvan Albert 101k

If the file is uncompressed (or if you find a bgzip compatible stream.seek for your programming language) you can jump near the end of the file quite efficiently:

import os

fname = "subset.vcf"
stream = open(fname)
size = os.path.getsize(fname)

# Move back a few kb
shift = 10 * 1024
stream.seek(size-shift)

for line in stream:
    pass

print(line.split())

it will be pretty fast:

 time python wind.py
['NC_000001.10', '25202832', 'rs1640276932', 'TGGGGTGGGGCGGGGCGGGGAGGGGC', 'T', '.', '.', 'RS=1640276932;SSR=0;VC=INDEL;GNO;FREQ=dbGaP_PopFreq:1,0']

real    0m0.043s
user    0m0.016s
sys     0m0.016s

versus streaming the whole file:

time cat subset.vcf| tail -1
NC_000001.10    25202832        rs1640276932    TGGGGTGGGGCGGGGCGGGGAGGGGC      T       .       .       RS=1640276932;SSR=0;VC=INDEL;GNO;FREQ=dbGaP_PopFreq:1,0

real    0m1.492s
user    0m0.313s
sys     0m1.578s

ADD COMMENT • link 2.2 years ago by Istvan Albert 101k

0

Entering edit mode

How does running bctools query -f'%CHROM %POS\n' | tail -n1 compare?

ADD REPLY • link 2.2 years ago by 4galaxy77 2.9k

0

Entering edit mode

time bcftools query -f'%CHROM %POS\n' subset.vcf | tail -n1
NC_000001.10 25202832

real    0m29.221s
user    0m6.281s
sys     0m23.922s

for the record, the file has 9 million variants

cat subset.vcf | wc -l
9199371

ADD REPLY • link 2.2 years ago by Istvan Albert 101k