pysam fetch speed
0
0
Entering edit mode
10.1 years ago
noah ▴ 10

I'm using pysam's bam fetch() method to pull reads out of a few random regions of the genome. Since fetch returns an iterator, I assumed it would be approximately as quick as the file seek operation, but it appears to be performing a substantial amount of work to pull out the first read in regions of high coverage:

In [45]: %timeit bam.fetch("chr1", 121485288, 121485289).next()
1 loops, best of 3: 3.4 s per loop

In [46]: bam.count("chr1", 121485288, 121485289)
Out[46]: 849385

In cases like this with high coverage, I'd like to only pull out the first thousand reads or so in order to speed up the operation. Is there any way to run this query so that it doesn't spend almost all the time (3.5 out of 5 seconds) accessing the first read? Is this a limitation of the bam format?

pysam bam python • 4.6k views
ADD COMMENT
0
Entering edit mode

After a bit more exploration, this appears to be a problem with getting from the closest indexed position to the region of interest. Would be nice if someone with more knowledge could confirm that this is in fact true.

ADD REPLY
0
Entering edit mode

It'd be worth looking into how a pysam.IteratorRow is actually implemented. If it's done by first reading the alignments into memory and then constructing an iterator then that's likely the problem. In general, it's a better idea to stick to the htslib API where a callback function is given to bam_fetch(). It'll still iterate over things, but it may well be faster.

ADD REPLY
0
Entering edit mode

This makes sense. The resolution of BAI index is 16Kbp, which means that in order to fetch reads overlapping a single position, on average one has to throw away about 8000 * coverage / (read length) records before getting to the first overlapping one.

ADD REPLY

Login before adding your answer.

Traffic: 2215 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6