Hello,
Is it possible to parse the read entries in an (unaligned) BAM file one by one, without requiring huge amounts of memory?
I've tried samtools view <bamfile> | <python script>
as well as using the pysam.AlignmentFile()
parser from inside the script, but both solutions on our cluster end up taking over 60GB of RAM for a 6GB BAM. I do believe we have nodes that have can handle a lot more ram than that, but I'm still annoyed by requirements that wouldn't run on a laptop if needed.
I've briefly tried to look around, but nobody seems to be asking this question with regards to simply parsing a BAM. Most memory-related topics for samtools seem to revolve around sorting.
So, is there a more resource-efficient way to parse BAMs progressively, or does the whole thing need to be decompressed into memory first (presumably that's what's happening) before the entries can be accessed sequentially?
Thanks!
Parsing a BAM file, read by read, should only require the memory needed for the read currently open. Niether of the approaches you've outlined should neccesarily mean a decompression of the whole file into memory. Two possibilities:
At some point your input iterator in your code is being converted to a list. For example by using
.readlines()
orlist()
.Something in your code is meaning that more information is being stored than is necessary for each read. Of course just storing every readname from a file is going to take a lot of memory for a big file, so its imperative that only the absolute minimum information is retained about each read.
Perhaps you'd be able to share some of your code with us?
It would be difficult to post the whole working code as it is part of a bigger thing. But I am not converting the iterator to a list:
What I'm saving is lists of match positions, short substrings. At the end I make
collections.Counter()
objects out of those and print them out.In retrospect, the collective text representation of the counters is ~130MB. So I did some rough calc. If I was saving just the 50bp read sequence from each entry, that would easily add up to some 10s of GB. I'm not saving the whole reads or read names, but the amount of smaller stuff I do save is probably equivalent in the end.
So it is actually plausible the it's my script that's being the memory hog. Thanks for pointing that out!
The reason I thought it would be samtools, is that I've gotten used to even pure samtools operation like merge being memory hogs.
I guess now I have to figure out a way to do my stats on the fly instead of keeping it all in lists the whole time. Maybe dictionaries of counters, instead of lists of repeated values.
what do you want to do with the BAM file ?
I want to get my hands on the read sequences to collect statistics on the presence and position of certain sequence patterns.