I have had this problem for quite sometime now: I want to extract reads if they or their mates are mapped to a region. The default behavior of samtools when given a region is to extract reads based only on their own RefID and Position, disregarding the mate's RefID and Position. Since the mapping info of each read's mate is included on the same line in the SAM format, it's easy to write a script to do that, but samtools remains exponentially faster, I think mainly because of random-access facilitated by the BAM index rather than line-by-line processing of the SAM file. So I have decided to take some time to try and optimize read-extraction (according to my criteria) because I plan to do that a lot.
After spending some time trying to understand samtools code, I realize the issue is that the BAM index doesn't contain the mate information (which makes sense because it is not used by samtools). However, I must admit that I'm still intimidated by the code. I thought maybe someone who's worked on samtools development before could point me in the right direction. I would like to add two fields to the BAM index: mate_tid, and mate_position. Then the next step would be to extend the cases for including a read when given a region to account for the mate info, but I think that part would be easy.
PS: please also let me know if I'm completely off in my understanding of the BAM index!
Before you reengineer half of htslib, have you tried instead:
That will run faster than actually doing what you propose and be much much much easier to implement (the hashing stuff is already available in the htslib headers).
Are you saying that I do the hashing and so on for each region query? If so, I don't see how that's faster. I want to hash/index once then do gazillion region queries.
That's the trick, you don't have to do the gazillion region queries that you would have otherwise done. Since the mates will mostly cluster together, you'll do a couple region queries. This will be vastly faster than what you're proposing in your original question.