Is an --SPMR normalization in callpeak truly valid to use before
constructing a signal track using bdgcmp?
It appears to be valid for construction of a signal track (a); however, if I understand correctly, such an approach may not be appropriate for (b) peak calling and (c) comparisons between libraries:
(a) with respect to building signal tracks
This approach has been used (with or without modifications) in publications, including this recently published (at the time of this writing) manuscript:
Tata et al., Dev Cell 2018; https://doi.org/10.1016/j.devcel.2018.02.024
From the Methods:
Normalized fold enrichment tracks were generated by using the callpeak
function with the -SPMR flag and a 200bp extension size, then passing
the bedgraph outputs into the bdgcmp function with the setting -m FE
(fold enrichment).
Another publication that used this basic approach (albeit adapted):
Murphy et al., Cell 2018; https://doi.org/10.1016/j.cell.2018.01.022
From the Methods:
Data was then processed using MACS2 for enrichment scoring and peak
calling. Briefly, read count normalization was performed on alignment
files to account for sequencing depth differences, and then background
corrected based on input using the bdgdiff function of Macs2. Finally,
Log10 fold enrichment over input lambda was then calculated for each
nucleotide across the entire genome using the bdgcmp function of
Macs2.
More MACS2 details at the related GEO entries, e.g., https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2494905
Macs2 was then used to generate read count normalized genome wide
pileup tracks and lambda tracks for precipitated samples and input.
(callpeak --nomodel --extsize 150 --SPMR) Pileup tracks were then
Input corrected using the Macs2 subtract function (bdgcmp -m
subtract). Log10 Fold enrichment was then calculated for each sample
comparing input corrected pileups to input lambda. (bdgcmp -m logFE).
Peaks were then called as regions with greater than 3 fold enrichment
over lambda using the Macs2 bdgpeakcall function. (bdgpeakcall -c
0.477 -l 100 -g 100)
More relevant information at this Biostars Q&A entry: macs2 normalization on broadPeak
(b) with respect to peak calling
From the $ macs2 callpeak --help
output (version 2.1.2):
If you plan to use the signal output in bedGraph to call peaks using
bdgcmp and bdgpeakcall, you shouldn't use this option because you will
end up with different results.
(c) with respect to comparisons between libraries
From the following URL: https://github.com/taoliu/MACS/issues/55
If you want to compare different libraries, like in your case, first
choice is that you can try 'bdgcmp' to get some statistic
measurements; or you can ask MACS2 callpeak to generate pileup per
million basepairs for you with --SPMR by downscaling. But if you want
to use the second method, you have to set both replicates the same
fragment size by '--nomodel --extsize XXX' to make the comparison
fair. Also bear in mind that such scaling will affect statistic
significance since 100 vs 50 is more significant than 10 vs 5, so if
library sizes differ a lot, scaling such as SPMR may not be a good
option.
I hope others can add more to this discussion.
More interesting and (somewhat) related links:
In my understanding
--nolambda
in the absence of a control is only valid for broad peaks, so diffuse histone modification ChIPs. If you apply it on sharp peak ChIPs like transcription factors, you deactivate the local background estimation that aims to decide if a peak is true or false based on the signal landscape surrounding the putative peak.Thank you for that sensible explanation.
Hi, so I'm aware a few months have gone by but the Building signal track tutorial still shows --SPMR and there is the inconsistency that you explained above and wasn't commented.
Any thoughts on this? Is the use of --SPMR in that context not correct anymore and the tutorial should be updated?
And what about the use of --nomodel? is it recommended without control?
Thank you for your time