Hey everyone :)
Here's a 100% python BAM file reader, and it has some very good performance characteristics:
https://github.com/JohnLonginotto/pybam
This little proof-of-concept script aims to do just 1 job - if you want to iterate over all reads in a BAM file, it will do that for you as quickly as possible, giving you the data back as either raw BAM or SAM.
>>> import pybam
>>>
>>> pure_bam_data = pybam.bgunzip('./ENCFF001LCU.bam.gz')
>>>
>>> parser = pybam.compile_parser(['pos','mapq','qname'])
>>>
>>> for read in parser(pure_bam_data):
.... print read
.... break
....
(3000742, 0, 'SOLEXA1_0001:4:49:11382:21230#0')
The script has two parts - a class for decompressing the BAM file and parsing out it's header, and a generator function that gets defined at run-time to decode the packed binary BAM data into SAM data that you can use in python.
Originally I was skeptical that Python could do this well, because unpacking binary data is not something python is known for being particularly good at - however, after playing around a bit I see that on modern Python implementations it is actually fairly reasonable - and on PyPy it is as good as any C. So on an uncompressed BAM file, doing it all in python is at least 2-3x faster than using Pysam - which has to make a call out to C land and then bring your value back into python every time you want something. If you want more than 1 thing from a read, this really adds up. On a 23Gb Encode file (ENCFF001IMQ.bam), I ran the two code snippets:
pysam - 20min 53s
import pysam
pure_bam = pysam.AlignmentFile('./ENCFF001IMQ.bam')
for read in pure_bam:
read.pos
read.flag
read.rname
read.mapq
read.rnext
read.pnext
read.tlen
pybam - 12min 53s
import pybam
pure_bam = pybam.bgunzip('./ENCFF001IMQ.bam')
parser = pybam.compile_parser(['pos','flag','rname','mapq','rnext','pnext','tlen'])
for read in parser(pure_bam):
read[0]
read[1]
read[2]
read[3]
read[4]
read[5]
read[6]
However, the above comparison is a little unfair, and I'll explain why.
Decoding the BAM data into SAM is actually fairly irrelevant to the overall BAM-reading time. This is because 80-90% of the work that the computer has to do is decompressing the BAM data, not decoding it, which I found surprising. In short, even though I spent most of my time in this project optimizing the decoder, speeding up the decompression has a much bigger effect than any time-savings gained from having a fast decoder.
Unfortunately, decompressing gzip files (which is what a BAM essentially is) is notoriously broken in python. Despite being two modules for it (zlib and gzip), they are both 2-3x slower than the gzip program you'll find already installed on your machine (http://aripollak.com/pythongzipbenchmarks/), or the C decompression code used by pysam/htspython. The take-home message here is that pure-python decompression/decoding is only faster than pysam when the compression-to-data ratio is high -- i.e. there are many reads packed into as little compressed space as possible, so we spend more time on decoding than decompression.
Unless, of course, you use your local copy of unix gzip (or even better, a parallel decompressor like pigz) and pipe the uncompressed data to python. In the comparison above, thats exactly what its doing - detecting I have pigz installed, subprocessing it out (with 3 threads, because I don't see much improvement above 3 on my 4-core machine), and then reading from its stdout. Using gzip (1 thread) its just a bit faster than pysam, by only a few minutes. Also it was run in pypy, which I recommend, but again decoding is not as significant as decompression so you wont see a huge change running it in standard python.
Really, honestly, no one is going to use my code -- I'm only putting it up here to demonstrate two things:
1) Future next-gen file format creators should aim to make their formats quick to decompress by design. The actual encoding of the data before compression is fairly irrelevant. A typical L2 cache these days is at minimum 256K, with most personal computers having 512K+, and compute servers having much much more - so our compression blocks should probably be increased to double or quadruple what it is for BAM (65KB hard-limit). Formats should also be designed for parallel decompression. As good as pigz is, parallel gzip was an afterthought. There are compression/decompression algorithms designed specifically for multi-core decompression from a stream, and we should take advantage of them :) If you have written a BAM parser, or you read/write BAM data - you can dramatically improve performance by making use of multiple cores.
2) Python doesn't have to be slow any more. samtools view -c
on that 23Gb file takes 11min18s , while on python with 3 pigz decompression threads it takes 11min58s. The -c
flag for samtools uses code optimized just for counting, and skips the unpacking of all the other data. The python code however is not optimized for counting, so I'm sure you could write a counter in python that's just as fast as the C (particularly if there was a better gzip module). This, however, only holds true if you're using pypy, and that's the big gotcha. There is a growing split in all interpreted languages between the vanilla run time environments, and the JIT'd/optimized environments (like pypy for python). The JITs just keep getting faster and faster as new CS ideas about how to do it are developed. I could have, for example, used some of the special pypy string concatenation functions to really speed up the decoder, particularly for the seq/qual, but then it wouldn't work on regular python. This is troubling, and is touched on here: https://vimeo.com/61044810 - no one wants there to be two kinds of python. But no one wants to wait around for hours for their python/perl/ruby programs to execute either. I don't know what will win out, community inertia or performance, but one thing is certainly true its up to the developers of new python software. If you write code being mindful of performance optimizations that JITs can do, and try to stay within the standard library or only use big external libraries like numpy - the more likely your users will be able to run your code at C-speeds using a JIT, with none of the issues usually associated with write/compiling C in Bioinformatics.
When you get a chance, could you maybe add a bit more information to your README.md page on github? Ideally some variant of this post would be a start.
Yes definitely - I was thinking of cutting the blurb in the body of the code out and putting it in the readme. And writing some tests. And putting it on pypi. But man, I have lab meeting in 5 days and I cant present any of this, so i'm going to have to put it on hold until a free weekend or something. The README though, I should do that now :)
I'll have Steffen harass you about this in your lab meeting then :)
1) Decoding speed and parallelization have long been considered in BAM and CRAM. BAM prefers zlib over bzlib mainly because it is much faster on decompression. In the early days, I rejected the request to switch to bzlib precisely for the sake of performance. BAM is easily parallelizable as BGZF blocks are independent. Samtools does not take the full advantage of possible parallelization because it is a bit old and doing that would need to rewrite the BGZF module. Sambamba is designed for parallelization from ground up. It is better at multi-threading. CRAM can also be parallelized. The parallelization part is more efficient than samtools.
This looks like a cool experiment. I see that you reference Peter Cock's bgzf module. Can I ask if you tried implementing virtual offsets using the BAM index? I've wanted to implement something like this in my simplesam module for a while, but am currently using samtools for bam reading.
Thanks Matt!! I must admit, I saw your simplesam module a few days back on another thread and I thought it was awesome, which is why I thought it might be useful to lift out the part of my code that does the BAM reading from another project and make it it's own thing. Originally I was going to call it simplebam and then post it under your post saying "maybe you can use something here to remove your samtools dependency?", but then I thought that might be a bit pretentious since my code is very very, er, basic. No unit tests. No integration tests. No documentation. Not even sure SAM data is reported perfectly (I think the pos values are off-by-one and I dont know how the qual values are turned to phred scores). So.. yeah its not exactly done, just a proof of concept that hopefully would get someone else interested in taking it on.
As for making use of the indexes... well, uh, its a real pain. New versions of bgzip make sure compression blocks start and end at the beginning/end of a read, and the header gets its own block too. But older versions (and the data im using to test) doesnt do any of that, so you cant jump to a random block using the virtual offsets and start reading the BAM data because you might be in the middle of a read, and much like a reading-frame issue for DNA, you'll unpack garbage. I mean, you could try and find some tell-tale signs like the QNAME being a string of ASCII followed by a \x00, but its not proper. The only proper way is to use the BAM index (if it exists), which i presume tells you which byte of the uncompressed data a indexed read starts, which you use in conjunction with the virtual offset. That would probably double the amount of code needed, introduce some new bugs, and so if you're going that far things stop being simple. I'm not saying its a bad idea, but it does become a proper project then. Like what if the user doesnt have a .bai? What if the .bai is wrong? What if the data is sorted weird? So because of that, I decided if I wanted to keep things simple I should just dump the BAM data to SQL instead, where there's just 1 proper index, you get better compression, on-the-fly header generation, its writable in-place, etc etc etc etc. That's the bam+ format alluded to in a pybam comment somewhere.
So yeah, for about 100 lines-of-code you could make a "BAM bisect" to find quick-ish the read you need from a BAM file - but it would involve a lot of guess work. I doubt it would ever fail though. For about 1000 lines you could do full virtual offset and bai decoding -- and im not sure how much better that would be :P
On size, I guess it depends on the SQL implementation. I am not that certain you can get better compression than BAM, let alone CRAM. Again depending on the SQL implementation, scanning through a table is likely to be slower than reading a BAM. In addition, with SQL, you will lose streamability, one seek per query and direct http/ftp access. Even before BAM, people have been talking about alignment in SQL, but till now I have not seen practical implementations that match the performance and the necessary feature set of BAM/CRAM.
On indexing, are you thinking of a new indexing algorithm? To implement the standard BAM indexing, you don't need to care if an alignment is bridging two BGZF blocks or not.
What I believe is happening with these modules is the decompression and interpreter are running in the same process, whereas opening a new shell to run zcat effectively spreads the work out to multiple cores (and more so with pigz). The same is true in Perl, where doing the work with the core module is about 2X slower than opening zcat for reading. There is also a gzip I/O layer for Perl that is as fast (or faster in some tests) as opening zcat, but it is not a core module. I only mention that because I think another reason these modules (Perl/Python/whatever) are slower is all the magic associated with objects and tied filehandles, as opposed to just reading from a stream. So, I'm not sure I'd say it is broken in Python but rather it is a "feature" of this type of runtime/environment or whatever you want to call it. All these new compilers and runtimes are exciting but it is a little overwhelming to see the number of different directions progress is taking in terms of separate projects. Cool post, it gave me a lot to think about!
I was trying your code to read the bam file but found that there is not 'pip install pybam' but it's 'pip install pybamm'. Can you please tell me how did you install the pybam? because, 'pip install pybam' and 'import pybam' are not working.