Entering edit mode
7.1 years ago
QVINTVS_FABIVS_MAXIMVS
★
2.6k
I would like to parse a BAM file in parallel using pysam and multiple_iterators
Here is my code
import pysam
import sys
from multiprocessing import Pool
import time
def countReads(chrom,Bam):
count=0
#Itr = Bam.fetch(str(chrom),multiple_iterators=False)
Itr = Bam.fetch(str(chrom),multiple_iterators=True)
for Aln in Itr: count+=1
if __name__ == '__main__':
start = time.time()
chroms=[x+1 for x in range(22)]
cpu=6
BAM = sys.argv[1]
bamfh = pysam.AlignmentFile(BAM)
pool = Pool(processes=cpu)
for x in range(len(chroms)):
pool.apply_async(countReads,(chroms[x],bamfh,))
#countReads(chroms[x],bamfh)
pool.close()
pool.join()
end = time.time()
print(end - start)
I get this error when I run it.
TypeError: _open() takes at least 1 positional argument (0 given)
But it spits out a whole bunch of errors. Can anyone help me to use multiprocessing to read a BAM file in parallel using pysam?
Thanks
This makes sense to me - except, do you really need the
multiple_iterators = True
here, then?From
pysam
documentation forfetch()
:It is my understanding that in the code above, you moved opening of the file (
bam = pysam.AlignmentFile(BAM, 'rb')
) and creating a separate filehandle to each separate thread. Therefore do you need to also includemultiple_iterators = True
? That sounds like doing the same thing twice.I am asking because I'd like to use something very similar, but the
countReads()
function would look something like this instead:Including
multiple_iterators = True
here would reopen the file for every region of the chromosome, which would make this a much slower process.EDIT: I believe that this issue thread on pysam's Git repo confirms the claim above:
multiple_iterators = True
is only needed when using multiple iterators in the same process; when opening a separate file handle in each process,multiple_iterators = True
should not be necessary.