I have a set of regions in the genome (about 4 million regions, parametrized by start and end genomic cooridnates) that I want to align BAM reads into. I want to get all reads that overlap that set of coordinates.
The naive way to do this is very slow and I'd like to use interval trees. However, to use interval trees, I'd have to first index all the BAM reads into a tree, and then, for each coordinate in my set of regions, look for all the reads within the tree.
My concern is that this requires loading all of the BAM reads into memory since I have to index them into the interval tree data structure.
Is there a way around this, or a Python library that does this already? Any help would be appreciated.
Thanks.
p.s. I'd prefer to do this within Python and not use BEDTools unless there's an easy way to integrate Python with BEDTools that someone can recommend?
Can you explain why you need an interval tree? If you have an aligned BAM file, you can sort and index it, then retrieve reads in a region using 'fetch' in pysam: http://code.google.com/p/pysam/
fetch will be horribly inefficient. If you iterate through a list of coordinates and fetch the reads from each one, it will take forever.