Question

Tool:epic: new ChIP-Seq caller based on the SICER algorithm

9

Entering edit mode

9.1 years ago

endrebak ▴ 980

The original SICER is great, but does not work on very large file hence this complete rewrite.

Thanks to the original authors for their great work: Chongzhi Zang, Dustin E. Schones, Chen Zeng, Kairong Cui, Keji Zhao and Weiqun Peng

Please try epic and report issues. Bugfixes should be out within 24 hours on weekdays, unless they are very demanding.

Requires a fairly recent version of the python science stack (at least pandas 0.17 IIRC)

MIT license.

Will release a paper eventually.

Edit:

I really should find a better name. On paper I liked the epi/epic/epigenetics link but now when I hear it it sounds so boastful I cringe. And exorcised sounded like a slight on the original software... Mad MACS?

Name suggestions welcome.

Edit 2: I've removed bam support. There are a million good reasons for this. See the pragmatic programmer for some general ones: http://flylib.com/books/en/1.315.1.30/1/ ("The power of plain text")

ChIP-Seq • 4.8k views

ADD COMMENT • link updated 22 months ago by Ram 45k • written 9.1 years ago by endrebak ▴ 980

1

Entering edit mode

If you prefer/need to use the original please see: https://github.com/dariober/SICERpy

ADD REPLY • link 9.1 years ago by endrebak ▴ 980

1

Entering edit mode

Can you comment more on why you removed bam support?

ADD REPLY • link 8.6 years ago by Ryan Dale 5.0k

0

Entering edit mode

Removing BAM support was a bad idea in my opinion. I couldn't give you a million reasons for why, but i'd certainly like to hear your thoughts on why. Was it just implementation/technical issues? Since you're doing all this in python perhaps I can help?

ADD REPLY • link 8.6 years ago by John 13k

0

Entering edit mode

That chapter that you refer to is not applicable - it talks about using text to store parameters or simple, text oriented information.

That should not be constructed as an argument against having data in binary format.

ADD REPLY • link 8.6 years ago by Istvan Albert 102k

0

Entering edit mode

You can scroll down Istvan 🙂

ADD REPLY • link 8.6 years ago by endrebak ▴ 980

1

Entering edit mode

Not sure what that means to scroll down - I don't see anything more that would argue otherwise. I do actually own and have studied the Pragmatic Programmer quite a bit - it used to be one of my favorite books - it has helped me become a better programmer.

It says that using a binary format where a text format could do is inefficient and counterproductive. But it clearly states that there are numerous use cases where a text format's weaknesses are "unacceptable"

Depending on your application, either or both of these situations may be unacceptable

ADD REPLY • link 8.6 years ago by Istvan Albert 102k

0

Entering edit mode

You will probably rerun the analyses many times. Having to run a time-consuming conversion step (the most time-consuming one in the algorithm) each time would be silly. It is also IO-intensive so parallell execution would not help much.

I am not just writing epic but a lot of helper scripts for ChIP-Seq and differential ChIP Seq. Adding a conversion step to bed in all of these before running the scripts would be a waste.

Also, where should I store the temporary bed files? Overflowing /tmp/ dirs is an eternal issue.

If I were to stream the data to bed using pipes, epic would not be fast anymore. I get a massive speedup from multiple cores if I use text files, presumably because the system knows it has the file in memory already. This is not the case if I start the pipe with bamToBed blabla | ...

There are many things that can go wrong when converting bam to bed, due to wonky bam files. I would get a bunch of github issues about "epic not being able to use my bam files" if I were to silently convert to bed within my programs.

I'll write more about it in the docs eventually.

ADD REPLY • link 8.6 years ago by endrebak ▴ 980

0

Entering edit mode

If you want to discuss bam-support, please do it here: https://github.com/endrebak/epic/issues/44

ADD REPLY • link 8.6 years ago by endrebak ▴ 980

0

Entering edit mode

I think these are valid points - the simplicity of a tool and the reduced complexity is always important.

From my perspective it feels like a communication problem - to me it mainly sounded like "I removed BAM file processing because I read in the Pragmatic Programmers that binary files are bad" so I had to comment ;-)

ADD REPLY • link 8.6 years ago by Istvan Albert 102k

0

Entering edit mode

We shouldn't be getting around wonky bam files by asking the user to figure it out on their own, but I understand your point that it's a lot of extra code/issues to look after and debug for other people.

Just a quick note while we're here, on Github you write

epic can use one core per chromosome, which should give a speedup of < ~22-25 (differs by species) by itself

but this is only true if all the chromosomes are the same length which is rarely the case, particularly for human. I think a more realistic speed up would be ~3-4x. If you could rework the scheduler to a queue of X cpus, where X is the number of free system cores, this should give you a healthy speedup to around 4-5x, depending on how efficient disk IO was previously.

ADD REPLY • link 8.6 years ago by John 13k

score 5 · Accepted Answer · 2016-04-06

5

Entering edit mode

9.1 years ago

dariober 15k

Hi endrebak- I started playing with epic and it looks great to me! I haven't done a side by side comparison with SICER but resullts look sane to me. A couple of points:

Output format: I think the column separator should be tab instead of space as this is pretty much the standard for many formats (bed, bedGraph etc.). Also I'd suggest avoiding spaces (and quotes) in the column names (e.g. use P_value instead of "P value").
I got this output where chromosome positions are in decimal format (e.g. 2360000.0). It didn't happen in other cases (maybe smaller datasets?).

I think it's a small bug that should fixed.

# /Users/berald01/.local/bin/epic --fdr-cutoff 1.0 --fragment-size 150 --gaps-allowed 3 --genome hg19 --input-string Input --keep-duplicates --nb-cpu 4 --window-size 10000 FILE ['data_tmp/ts023_TS-Input-19.bed.gz', 'data_tmp/ts030_TS9-100-Adapter13.bed.gz', 'data_tmp/ts031_TS9-100-Adapter15.bed.gz']
Chromosome  Start   End ChIP    Input   Score   P_value Fold_change FDR_value
chr1    2360000.0   2479999.0   5647.0  1811.0  62.2931303448   7.9850838978e-100   1.34384524519   1.59312541314e-99
chr1    3380000.0   3489999.0   3664.0  1235.0  41.88015277 1.04318667688e-46   1.27861097826   1.57594326759e-46
chr1    16830000.0  16979999.0  19060.0 5922.0  165.786124756   0.0 1.38708927844   0.0
chr1    17010000.0  17109999.0  10599.0 2749.0  110.524000087   0.0 1.66165199729   0.0
chr1    17190000.0  17299999.0  10248.0 2781.0  129.454390911   0.0 1.58813731276   0.0
chr1    24230000.0  24269999.0  3446.0  640.0   55.2620413087   0.0 2.32051949175   0.0
chr1    37100000.0  37649999.0  40273.0 10824.0 630.511687647   0.0 1.603526421 0.0
chr1    37700000.0  38759999.0  80812.0 25113.0 1228.13037325   0.0 1.38684262106   0.0

The way input and chip files are identified, i.e. via a --input-string match, is a little awkward. I guess this is due to docopt not allowing lists as arguments?
Small typo in the README file: Pandas (0.17 =<) should be Pandas (>= 0.17)

ADD COMMENT • link 9.1 years ago by dariober 15k

1

Entering edit mode

Thanks for looking at it. All good suggestions.

Thanks, I will just recast the start and end to int before printing.

Yes it is awkward. Argparse is the only parser that allows multiple lists as arguments. I guess I should change to argparse, even though it feels less Pythonic to me. I guess I can look at the MACS source which allows multiple input files.

I will change to tab and use underscore - I used space mostly for own convenience (viewing files without less -S when debugging).

I should also write a bit about how SICER is better than other software (I have found) for relatively weak signals over long stretches. Undersold feature not obvious from the paper.

I like all your suggestions so far, feel free to make more. Same goes for any features from SICER I have removed. I guess I should add bam support by bamToBed.

If you are interested in participating more I can move the repo to an org repo (I will of course do all the grunt work/heavy lifting though).

ADD REPLY • link 9.1 years ago by endrebak ▴ 980

1

Entering edit mode

Hi-

I guess I should add bam support by bamToBed.

If interested, I put here a java program that can be used to convert bam to bed on the fly. Essentially where you have

command = """{grep} -E '^{chromosome}\\b.*\\{strand}$' {bed_file} | ...

you could have

command = """java -jar /path/to/BamToBed.jar -i {bam_file} -chrom {chromosome} -q int -f int -F int | ...

You would have to ship BamToBed.jar with the bioepic package, but that should be ok as it depends only on JVM 1.6+, no need to install samtools/bedtools. (I gave this a try but I got mad about how to convince setup.py to put the jar file somewhere where it could be found from inside epic...!)

ADD REPLY • link 9.1 years ago by dariober 15k

1

Entering edit mode

For a tool's success is extremely important to look beyond current personal preferences and think about the large number of people that will have to use the tool. Of course it is difficult since initially the tool is being used only by the developers. If a tool takes a parameter in the form of:

FILE ['data_tmp/ts023_TS-Input-19.bed.gz', ...]

that immediately interferes to the standard way we use Unix tools. At this point one can't easily pass files to the script in the with simple bash patterns *.this_or_that , we have to painfully piece together arguments in bash etc.The same with space separated files, there is a reason that files are tab separated, everything else is basically hell...

ADD REPLY • link 9.1 years ago by Istvan Albert 102k

1

Entering edit mode

Hi Istvan- Just to clarify, the syntax reported above (FILE ['data_tmp/ts023_TS-Input-19.bed.gz', ...]) is how it appears in the output. What you actually execute on the command line is similar to the usual form epic -k -i Input file1.bed file_2.bed input.bed, so it's ok. Only thing is that you need a string that identifies the input files

Apparently docopt, which seems very well reviewed, deliberately doesn't support options with multiple argument (see this issue)... I'm not sure I agree but they seem to know what they are doing!

ADD REPLY • link 9.1 years ago by dariober 15k

1

Entering edit mode

ah, ok sounds good my bad. I tried to look it up in the readme and did not find the exact usage and assumed it was the way it was listed there. I would suggest to endrebak to add full usage examples in the main README.md that is very helpful in assessing what the tool actually takes as input and what form it does it.

ADD REPLY • link 9.1 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for both of your comments. Will do.

Ps. I am not very familiar with java so I'd rather only require bedtools bamToBed for now, but I am very happy that more bioinformatics tools are coming to the JVM.

ADD REPLY • link 9.1 years ago by endrebak ▴ 980