I recently made a naive benchmark of a few popular languages, for doing really simple bioinformatics processing of fasta files (I chose to count the GC frequency, as a simple enough problem that I could figure out how to implement that in a number of different languages).
After publishing the initial benchmarks (between Python and the D language), I got a lot of feedback and suggestions for improvements, from experts in the different languages, and so the comparison grew quite a bit, and I finally got together a comparison of no less than the following languages, sorted approximately from slowest to fastest (when parsing line by line):
The final results are available in this blog post:
My personal take-home lesson from this comparison is that by using modern compiled and garbage collected language like Go and D, you can get rather close to C++/C like performance, while not having to fear introducing too many catastrophic memory leaks (due to the garbage collector) and foremost, retaining much easier to read code. As an example, have a look at these code examples:
Personally, I have to say I was highly impressed by the D version, due to it's high readability even in the face of the serious optimizations. On the other hand, although the Go version is a bit harder to read, the Go language is an increasingly popular language with a decently sized community and has a very easy to use features for writing threaded programs (although I could not get any speedups using threading for this simple example).
In terms of ready-to-use libraries, there is an up to date bioinformatics package in Go, called BioGo, that seems promising and high-quality. (For D, there is some very rudimentary support for working with some bioinformatics related data in the DScience library, although it does not seem to be very updated).
In the end, I would put my bet on Go for the time being due to it's much bigger community, and the better library support (BioGo), although I like D a lot too.
If you really don't want to get the benefits from learning a compiled language, trying to get your python code with PyPy might be an option too of course (although I heard about some incompatibilities with the NumPy/SciPy packages etc, which sounds worrying to me).
Here's converting a 10G fastq file to fasta:
On my system, a 10G file took 2m20s using 67% CPU, so parsing isn't terribly CPU intensive, and using a lower level language will make it even less of a problem.
Hi. You forgot to mention something important. Your CPU. Without this info you cannot tell if 2 minutes is a slow or fast! Also, since your program takes 67% I conclude the program is multi-threaded. But we don't know how many cores you CPU have.
If the program takes 67% of one cpu core then it's more reasonable to assume that it's IO bound, and if any assumption is to be made about number of threads from this we would say it's a single thread that is waiting 33% of the time.
Have you checked out FastQC to make sure you aren't reinventing the wheel? http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Make sure that you do everything with compression (gzip, etc.) when possible. This reduces both network bandwidth and disk IO. If you are IO bound, doing so can actually make the task faster in some cases.
I think it's time for another programming challenge
Heres converting a 10G fastq file to Fasta:
Note that disk IO is the bottleneck here, not CPU.
Sorry about the idiot formatting, btw - I don't know how to get paragraphs in comments.