Question

Using pysam to parse bamfile and fetch based on bedfile, looking to speed up code.

2

Entering edit mode

8.7 years ago

Daniel ★ 4.0k

I have written a function that builds a 3D matrix from a bamfile, doing a pile up from bed file coordinates as the midpoint and region, with X being genomic position and y being size of the sequenced template.

However, I feel like it's heavy and could be more efficient, but I can't work out a way to trim it down. At the moment it gets the region from the bed and fetches the relevant samfile reads and builds it up. I've considered reading the sam just once and looping through the bed for each read but that seems like more processing...?

Is there any way to improve this, or is it just as good as it's going to get because it's such a heavy comparison.

import pysam

def multiRegionSimpleExtract(globalX,globalY,samfile):
    # Open bedfile
    with open(args.b, 'rb') as f:
        for line in f:
            b1 = line.split(':')
            chrom = b1[0]
            b2 = b1[1].split('-')
            midpoint = (int(b2[0]) + int(b2[1])) / 2
            # args.e is a range integer, here saying go 500bp up and downstream
            regionRange = [midpoint - args.e, midpoint + args.e]

            for read in samfile.fetch(chrom,regionRange[0],regionRange[1]):
                pos = int(read.reference_start)
                size = int(abs(read.template_length))

                # For each bp from start to end, add one to the matrix
                while pos <= read.reference_end:
                    globalX.append(pos - midpoint)
                    globalY.append(size)
                    pos += 1

At the moment I can pileup ~280 bed regions with two ~100million read (indexed) bam files with:

Runtime:0:00:13.022417

But if I use the ~1900 that I want to then I get to around 2 minutes and starts becoming unwieldy.

Runtime:0:01:55.956396

Ideally, I'd want to be able to feed it even more though. Any suggestions apprecieated.

python pysam • 5.4k views

ADD COMMENT • link 8.7 years ago by Daniel ★ 4.0k

2

Entering edit mode

At a certain point of random access, it's probably better to read the whole BAM file from top to bottom and look for reads that overlap with regions in your BED. That has been my experience anyway. This is particularly true if your BED regions are not sorted and you are 'bypassing' the need for a sort with random-read-access, but it's also true if you just read more than X% of the BAM file linearly, where X depends on how smart pysam/etc is at caching, etc. That could easily result in reading the whole BAM file more than 1x, even if none of your BED regions overlapped. If you rewrite your code to be BAM-iterated rather than BED iterated, and it turns out to be roughly the same speed, then i'd recommend using pybam and pypy and your code should run quite a bit faster.

ADD REPLY • link 8.7 years ago by John 13k

1

Entering edit mode

As John mentioned, at some point it's faster to linearly go through the BAM files (if you give a BED file to samtools view you'll find that that's what it's doing). The only other way to speed this up is to use multithreading. We do that with BAM files in deepTools to speed things up.

BTW, I expect you can speed up things by getting rid of the while loop near the end.

ADD REPLY • link 8.7 years ago by Devon Ryan 105k