pysam + multiple_iterators = Type Error during multiprocessing
1
1
Entering edit mode
7.2 years ago

I would like to parse a BAM file in parallel using pysam and multiple_iterators

Here is my code

import pysam
import sys
from multiprocessing import Pool
import time
def countReads(chrom,Bam):
    count=0
    #Itr = Bam.fetch(str(chrom),multiple_iterators=False)
    Itr = Bam.fetch(str(chrom),multiple_iterators=True)
    for Aln in Itr: count+=1

if __name__ == '__main__':
    start = time.time()
    chroms=[x+1 for x in range(22)]
    cpu=6
    BAM = sys.argv[1]
    bamfh = pysam.AlignmentFile(BAM)
    pool = Pool(processes=cpu)
    for x in range(len(chroms)):
        pool.apply_async(countReads,(chroms[x],bamfh,))
        #countReads(chroms[x],bamfh)
    pool.close()
    pool.join()
    end = time.time()
    print(end - start)

I get this error when I run it.

TypeError: _open() takes at least 1 positional argument (0 given)

But it spits out a whole bunch of errors. Can anyone help me to use multiprocessing to read a BAM file in parallel using pysam?

Thanks

python • 4.9k views
ADD COMMENT
2
Entering edit mode
7.1 years ago

fixed it. I was going off some online blog that was wrong

import pysam
import sys
from multiprocessing import Pool
import time
def countReads(chrom,BAM):
    count=0
    # here's the fix
    bam = pysam.AlignmentFile(BAM,'rb')

    Itr = bam.fetch(str(chrom),multiple_iterators=True)
    for Aln in Itr: count+=1

if __name__ == '__main__':
    start = time.time()
    chroms=[x+1 for x in range(22)]
    cpu=6
    BAM = sys.argv[1]
    pool = Pool(processes=cpu)
    for x in range(len(chroms)):
        pool.apply_async(countReads,(chroms[x],BAM,))
    #countReads(chroms[x],bamfh)
    pool.close()
    pool.join()
    end = time.time()
    print(end - start)
ADD COMMENT
0
Entering edit mode

This makes sense to me - except, do you really need the multiple_iterators = True here, then?

From pysam documentation for fetch():

multiple_iterators (bool) – If multiple_iterators is True, multiple iterators on the same file can be used at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.

It is my understanding that in the code above, you moved opening of the file (bam = pysam.AlignmentFile(BAM, 'rb')) and creating a separate filehandle to each separate thread. Therefore do you need to also include multiple_iterators = True? That sounds like doing the same thing twice.

I am asking because I'd like to use something very similar, but the countReads() function would look something like this instead:

def countReads(regions, chrom, BAM):
    count = 0
    bam = pysam.AlignmentFile(BAM, 'rb')

    for start, stop in regions:
        Itr = bam.fetch(str(chrom), start, stop, multiple_iterators = True)
        for Aln in Itr:
            count += 1

Including multiple_iterators = True here would reopen the file for every region of the chromosome, which would make this a much slower process.

EDIT: I believe that this issue thread on pysam's Git repo confirms the claim above: multiple_iterators = True is only needed when using multiple iterators in the same process; when opening a separate file handle in each process, multiple_iterators = True should not be necessary.

ADD REPLY

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6