https://github.com/endrebak/epic
The original SICER is great, but does not work on very large file hence this complete rewrite.
Thanks to the original authors for their great work: Chongzhi Zang, Dustin E. Schones, Chen Zeng, Kairong Cui, Keji Zhao and Weiqun Peng
Please try epic and report issues. Bugfixes should be out within 24 hours on weekdays, unless they are very demanding.
Requires a fairly recent version of the python science stack (at least pandas 0.17 IIRC)
MIT license.
Will release a paper eventually.
Edit:
I really should find a better name. On paper I liked the epi/epic/epigenetics link but now when I hear it it sounds so boastful I cringe. And exorcised sounded like a slight on the original software... Mad MACS?
Name suggestions welcome.
Edit 2: I've removed bam support. There are a million good reasons for this. See the pragmatic programmer for some general ones: http://flylib.com/books/en/1.315.1.30/1/ ("The power of plain text")
If you prefer/need to use the original please see: https://github.com/dariober/SICERpy
Can you comment more on why you removed bam support?
Removing BAM support was a bad idea in my opinion. I couldn't give you a million reasons for why, but i'd certainly like to hear your thoughts on why. Was it just implementation/technical issues? Since you're doing all this in python perhaps I can help?
That chapter that you refer to is not applicable - it talks about using text to store parameters or simple, text oriented information.
That should not be constructed as an argument against having data in binary format.
You can scroll down Istvan 🙂
Not sure what that means to scroll down - I don't see anything more that would argue otherwise. I do actually own and have studied the Pragmatic Programmer quite a bit - it used to be one of my favorite books - it has helped me become a better programmer.
It says that using a binary format where a text format could do is inefficient and counterproductive. But it clearly states that there are numerous use cases where a text format's weaknesses are "unacceptable"
You will probably rerun the analyses many times. Having to run a time-consuming conversion step (the most time-consuming one in the algorithm) each time would be silly. It is also IO-intensive so parallell execution would not help much.
I am not just writing epic but a lot of helper scripts for ChIP-Seq and differential ChIP Seq. Adding a conversion step to bed in all of these before running the scripts would be a waste.
Also, where should I store the temporary bed files? Overflowing /tmp/ dirs is an eternal issue.
If I were to stream the data to bed using pipes, epic would not be fast anymore. I get a massive speedup from multiple cores if I use text files, presumably because the system knows it has the file in memory already. This is not the case if I start the pipe with bamToBed blabla | ...
There are many things that can go wrong when converting bam to bed, due to wonky bam files. I would get a bunch of github issues about "epic not being able to use my bam files" if I were to silently convert to bed within my programs.
I'll write more about it in the docs eventually.
If you want to discuss bam-support, please do it here: https://github.com/endrebak/epic/issues/44
I think these are valid points - the simplicity of a tool and the reduced complexity is always important.
From my perspective it feels like a communication problem - to me it mainly sounded like "I removed BAM file processing because I read in the Pragmatic Programmers that binary files are bad" so I had to comment ;-)
We shouldn't be getting around wonky bam files by asking the user to figure it out on their own, but I understand your point that it's a lot of extra code/issues to look after and debug for other people.
Just a quick note while we're here, on Github you write
but this is only true if all the chromosomes are the same length which is rarely the case, particularly for human. I think a more realistic speed up would be ~3-4x. If you could rework the scheduler to a queue of X cpus, where X is the number of free system cores, this should give you a healthy speedup to around 4-5x, depending on how efficient disk IO was previously.