Let's say I have a BAI index, and a BAM file. I want to extract all the reads that intersect a given region on a chromosome, while reading as little data from the BAM file as possible. How can I do this ?
At first, I naively thought that I could use the linear index to simply find all the linear bins that intersect my region. But this does not appear to be enough, since there could be long reads that start before the start of my region, and end after it. Such reads intersect my region, but I won't find them using the naive approach.
I thought of using the R-Tree index as well, but it looks like I'd need to essentially read the entire BAM file in order to find all the intersecting reads -- since there's no guarantee that bin 0 (the one that spans the entire BAM file) doesn't contain the one long read that intersects my region.
Is there a better approach, or am I missing something ? I apologize for asking such a primitive question, but I'm new to BAM files, and not yet familiar with the accepted practices.
Not an answer as I haven't implemented this myself so don't know all the ins and outs.
However basically the BAI index maps ranges (produced as "bins") to file offsets. Given the R-Tree isn't binary, a single bin may have multiple start/stop points for data within it, hence the linear index too.
There is code in htslib's hts.c to do this, but it's nightmarishly complicated due to a myriad of closely related index formats. Perhaps something like IGV's code may be easier to understand. It's certainly much shorter so may be a good learning code.
https://github.com/igvteam/igv.js/blob/master/js/bam/bamIndex.js