I am calling peaks on ChIP-seq data (BED files mapped with BWA) using MACS 1.4. I observed that there were a large number of peaks with a score of exactly 3100.00, while all the surroundings scores only belonged to single peaks.
I did a bit of digging and discovered the following line of code in PeakDetect.py
(of MACS 1.4):
p_tmp = poisson_cdf(tlambda_peak,local_lambda,lower=False)
if p_tmp <= 0:
peak_pvalue = 3100
else:
peak_pvalue = mathlog(p_tmp,10) * -10
– In other words, MACS apparently calculates a lot of negative or zero p-values from its poisson distribution which pile up in my output (I hope it’s 0 values since negative probability values make no sense whatsoever).
Unfortunately, this is clearly an artefact. Where could these values come from? I have no idea where to start looking.
What is worse, some of the scores are actually bigger than 3100, which is just plain weird if you invert the above formula: the corresponding p-value would have to be smaller than 1E-310, which is well outside the range of double-precision floating points (in fact, even LDBL_EPSILON
is a hundred orders of magnitude bigger!).
How are these values even technically produced? I’m at a total loss …
PS: the hard-coded value 3100
vanished from the source of MACS2 but the source changed so completely that a side-by-side comparison of the two methods is impossible.
Negative probabilities? Maybe a case for http://www.biostars.org/post/show/7126/what-are-the-most-common-stupid-mistakes-in-bioinformatics/
But seriously, that code seems to test p <= 0, so this it is not proof that any
p_tmp
in fact is < 0. I have seen more inclusive checks as good practice or conservative coding to prevent exceptions, e.g. in case of rounding errors using==
may miss slightly negative values. If you are in doubt I would check the code of poisson_cdf and also print a warning if p < 0, just add debug messages.Right, I don’t know why I initially assumed p < 0. Of course p = 0 makes vastly more sense (but having the inclusive check in code is decidedly bad practice since it thus fails silently and masks bugs instead of alerting the user, and doesn’t state the intended semantic clearly). Be that as it may, having p = 0 is similarly annoying. Nothing in my count data should be able to produce such an abnormal value.