Hello all :)
Background: So I am just at the end of writing a program that runs through a BAM file gathering the statistics the user requested via command line parameters. Since the adoption of pypy in this community will directly effect how I write this program, it would be great to get some feedback on how many people have pypy installed :)
Thank you so much!
I will delete the two answers below once the poll ends to reset my rep counter, because thats a little unfair, and i will post the results below this text.
The results are in! As of 12th March 2016, it appears:
People with PyPy: 5
People without PyPy: 11
I'm sure the 'without' is significantly under-represented since, as Ram very correctly pointed out, some people may not know if they have it or not, and don't want to spend time trying to find out. I would probably guesstimate that access to pypy is around 30-40%, while usage of pypy is probably even lower still. This isn't helped by the fact that there aren't so many tools which state "works with pypy!", and perhaps going forward python developers should consider helping this along by using pypy-compatible imports like htspython over pysam, or functions in Numpy that are known to work in Numpypy (a significantly faster Numpy re-written for pypy).
Thank you all for participating - perhaps we can take another snap-shot in a year or two and see if these numbers have changed :)
BUMP for 2017 :) If you answered previously it would be good to know if you changed to/away from PyPy too :)
Maybe change the title of the post to match the question being asked. Some people may have PyPy installed but never use it.
Good point - changed :)
Some thoughts --
How much time does the exec() trick save? Is the function call overhead really significant versus the pure-Python computation being done inside the stats function(s)?
The PyPy 5.0 release notes suggest you can embed PyPy within a C program, if you like. In any case it looks like there are options for effectively bundling PyPy with the code you distribute.
Alternatively, it's possible Numba would let you JIT the stats-collecting function that needs to be fast, without much disruption to the rest of your code.
There might be something in pytoolz or cytoolz that will let you use partial evaluation more elegantly here. I'm not sure exactly how to do that besides your exec() trick, though.
So I ran this stats program on a 1Gb BAM file, calculating FLAG, RNAME, TLEN and GC% for every read in the file, and sticking it in a dictionary of counts.
The results below are the times with/without pypy, with/without exec(), and with pysam/htspython. Because htspython is written for CFFI (which is a project closely affiliated with pypy), it works in pypy while pysam does not (and probably never will). Both pysam and htspython work in regular CPython though.
Even though pypy is clearly faster, this is actually a pretty bad use-case for pypy, because apart from calculating GC%, we're not doing a whole bunch of math that pypy can optimise. Most of these speed benefits are just from interfacing with the C library that reads the BAM file (hts) in a more efficient way. It also looks like htspython - which is still an extremely young project by brentp - could prove to be even faster going forwards, based on the CPython times. If I added more stats to gather for this test, or increase the file size, pypy times would have dropped like a stone relative to the CPython version. However - this should actually be very interesting to anyone who uses python for reading BAM files, because whatever your code is right now under pysam, you can get atleast a 2x speedup in reading the file alone using pypy & htspython. And to show i'm not pulling numbers out of the air, here's a screen shot of whatever.
To answer your other questions:
I can't write in C, although I think the future of efficiency in python will be putting your main loop in C. 99% of my code is checking user input, calculating stat dependancies, ordering the stats so they are calculated in the most efficient manor, etc etc. I would hate to do all that fluff in C. But it seems like in the future we'll be able to have our cake and eat it.
I've tried Numba in a variety of different projects and it never works for me. Honestly, I just don't think it works. If pypy is particular about what it can work with, Numba is just impossible. I've tried other similar 'decorate and forget' python things, but they never come close to what they promise. Cython did perhaps, but everything else comes nowhere near Java speeds.
I've never heard of either pytoolz or cytoolz, so i will definitely check them out right now! Thanks :D
Tomorrow morning I will delete the three poll answers below and collect the final scores from this snap-shot survey. Although we had a few pypy users at the end of the day, <50% adoption means that right now, if I write my code for htspython without exec (which is the best bang for your buck because functions as strings are horrible), it will severely discriminate the users without pypy. So I will probably do with exec(). -sigh-
Thank you everyone that voted so far! It really really helped :)
If htspython needs cffi to work, be aware that getting cffi itself installed and working can prove problematic for the average user. So don't assume that regular users will be able to
pip install your_package
and have it work (this is the sole reason we're not using brentp'sbw-python
package in deepTools). If your target user base is a bit more advanced then this isn't an issue, but keep it in mind.Yeah, I kind of glossed over the fact that installing hts from inside the pypy virtualenv is technological equivalent of washing a cat :/ However, if the resultant program takes 1 hour instead of 3 hours, then the user has 2 hours to figure it out before it becomes a waste of time :P But for deepTools, where most things are crazy fast to begin with (like compute matrix is just a few seconds), then you're right - unnecessary complexity.
Is washing a cat difficult?
Almost my whole day today has involved washing my cat -_-; http://imgur.com/a/QOcif
We tried speeding it up with a blow dryer, but he hid on my partner's head... cats are a pain in the *.
You need to post options for people to vote :)
Why would I need to have PyPy installed?
You don't need it installed, but for a Python tool developer it is very useful to know how many people have access to a PyPy environment. There are a few times where i have thought "oh, i could write this function to be 1 minute in CPython, and 10 seconds in pypy, or i could write it to be 2 minutes in CPython, and 1 second in PyPy. Which should i optimise for?"
This doesn't happen often, but when it does these sorts of stats can influence one's decision.
In that case can you not say that
pypi
is needed if you want to use my software and leave it at that.Well, the typical alternative is that I just make an if/else branch that checks the interpreter type. I did this in a previous tool that can use array.array (CPython), Numpy (both, but need numpy) or CFFI (PyPy) to store the array, whatever the user has installed - but the result was a real mess to maintain. I would develop the code for one branch, say the numpy version, and then i'd have to figure out post-hoc how to replicate those changes with array.array and CFFI, and so it wasn't fun anymore. I mean the result is you get the fastest code whatever you have installed - but it's a pain for me and so, eh, if i don't have to worry about supporting CPython because people are moving over to PyPy then that would be great, but i think more likely they won't install PyPy unless the tool doesn't work without it, or they already have it installed.
TL;DR i'll continue to make software with 1 pathway but it's 66% optimized for regular Python :P