I have written a function that builds a 3D matrix from a bamfile, doing a pile up from bed file coordinates as the midpoint and region, with X being genomic position and y being size of the sequenced template.
However, I feel like it's heavy and could be more efficient, but I can't work out a way to trim it down. At the moment it gets the region from the bed and fetches the relevant samfile reads and builds it up. I've considered reading the sam just once and looping through the bed for each read but that seems like more processing...?
Is there any way to improve this, or is it just as good as it's going to get because it's such a heavy comparison.
import pysam
def multiRegionSimpleExtract(globalX,globalY,samfile):
# Open bedfile
with open(args.b, 'rb') as f:
for line in f:
b1 = line.split(':')
chrom = b1[0]
b2 = b1[1].split('-')
midpoint = (int(b2[0]) + int(b2[1])) / 2
# args.e is a range integer, here saying go 500bp up and downstream
regionRange = [midpoint - args.e, midpoint + args.e]
for read in samfile.fetch(chrom,regionRange[0],regionRange[1]):
pos = int(read.reference_start)
size = int(abs(read.template_length))
# For each bp from start to end, add one to the matrix
while pos <= read.reference_end:
globalX.append(pos - midpoint)
globalY.append(size)
pos += 1
At the moment I can pileup ~280 bed regions with two ~100million read (indexed) bam files with:
Runtime:0:00:13.022417
But if I use the ~1900 that I want to then I get to around 2 minutes and starts becoming unwieldy.
Runtime:0:01:55.956396
Ideally, I'd want to be able to feed it even more though. Any suggestions apprecieated.
At a certain point of random access, it's probably better to read the whole BAM file from top to bottom and look for reads that overlap with regions in your BED. That has been my experience anyway. This is particularly true if your BED regions are not sorted and you are 'bypassing' the need for a sort with random-read-access, but it's also true if you just read more than X% of the BAM file linearly, where X depends on how smart pysam/etc is at caching, etc. That could easily result in reading the whole BAM file more than 1x, even if none of your BED regions overlapped. If you rewrite your code to be BAM-iterated rather than BED iterated, and it turns out to be roughly the same speed, then i'd recommend using pybam and pypy and your code should run quite a bit faster.
As John mentioned, at some point it's faster to linearly go through the BAM files (if you give a BED file to
samtools view
you'll find that that's what it's doing). The only other way to speed this up is to use multithreading. We do that with BAM files in deepTools to speed things up.BTW, I expect you can speed up things by getting rid of the
while
loop near the end.