Biopython byte positions are not compatible with bgzip
1
0
Entering edit mode
3.8 years ago
b10hazard • 0

I have a blocked gzip file where the data I want is between two byte indexes, which I determined using biopython's BgzfReader fh.tell() fuction. I can easily access this data using this code...

from Bio.bgzf import BgzfReader

start = 75191629497
stop =  75191634445

with BgzfReader(bgzip_path) as fh_reads:
    fh_reads.seek(start)
    for line in fh_reads:
        if fh_reads.tell() > stop:
            break
        print line

The code above works perfectly and prints out the expected data.

My problem is that these offsets do not work for other htslib utilities. For example, the bgzip command line utility has a -b option for the start byte offset and a -s option for the size of the data you want to decompress. Using the above example the size would be 75191634445 - 75191629497 or 4948 bytes. So I tried the following:

bgzip -c -b 75191629497 -s 4948 /path/to/bgzip

This command doesn't work. I get a "Segmentation fault (core dumped)" error. My question is... Can the byte positions generated and used by biopython's BgzfReader be used with other htslib based applications? If so, how would I do this? Thanks.

htslib biopython bgzip • 926 views
ADD COMMENT
0
Entering edit mode
3.8 years ago
Ahill ★ 2.0k

From the bgzip man page it looks like the command line bgzip -b option expects zero-based uncompressed offsets. But Bio.bgzf uses virtual offsets, which are not the same as uncompressed offsets. The Bgzf doc here looks like it may tell you how you can convert between virtual and uncompressed offsets: .

ADD COMMENT

Login before adding your answer.

Traffic: 1799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6