I'm using pysam's bam fetch() method to pull reads out of a few random regions of the genome. Since fetch returns an iterator, I assumed it would be approximately as quick as the file seek operation, but it appears to be performing a substantial amount of work to pull out the first read in regions of high coverage:
In [45]: %timeit bam.fetch("chr1", 121485288, 121485289).next() 1 loops, best of 3: 3.4 s per loop In [46]: bam.count("chr1", 121485288, 121485289) Out[46]: 849385
In cases like this with high coverage, I'd like to only pull out the first thousand reads or so in order to speed up the operation. Is there any way to run this query so that it doesn't spend almost all the time (3.5 out of 5 seconds) accessing the first read? Is this a limitation of the bam format?
After a bit more exploration, this appears to be a problem with getting from the closest indexed position to the region of interest. Would be nice if someone with more knowledge could confirm that this is in fact true.
It'd be worth looking into how a pysam.IteratorRow is actually implemented. If it's done by first reading the alignments into memory and then constructing an iterator then that's likely the problem. In general, it's a better idea to stick to the htslib API where a callback function is given to
bam_fetch()
. It'll still iterate over things, but it may well be faster.This makes sense. The resolution of BAI index is 16Kbp, which means that in order to fetch reads overlapping a single position, on average one has to throw away about
8000 * coverage / (read length)
records before getting to the first overlapping one.