As I already described in your previous post with this issue (peak calling for cancer and healthy data?), I believe the best solution is to use the input sample (GSM947411) and do the peak calling twice. That's what I would do.
EDIT: Just to reiterate this quickly, you need to distinguish between biological control and signal (normal cell line v. cancer cells) from technical control and signal (background noise levels v. supposed signal levels).
I think you are slightly confusing what peak callers like MACS do. In the essence, peak callers are just tools to distinguish between signal and noise regions. The input they take is just a set of x versus y measurements and output x coordinates that have significantly large y values. They have no biological assumptions encoded between themselves, just the assumptions about how the background noise levels are distributed. Since this noise is not uniformly distributed across the x axis (i.e. genome), MACS needs the input sample (which they call 'control') to estimate the per-base-pair noise levels. Once these levels are estimated, one can proceed to look at y coordinates that are significant.
Now your question is: "do the peaks differ between normal and cancer cell lines?". Translated to the x-y analogy above, it would be 'do the x coordinates where y is significantly large for normal cell line and the x coordinates where y is significantly large for cancer cell line differ?'. In this language it is perhaps clearer why MACS need to be run twice, and that it does not answer that question for you*. Hope this helps.
* Well, the standard options don't. As I mention in your previous post (peak calling for cancer and healthy data?) there are bdgcmp and diffpeaks options in MACS that might be helpful.
Just a suggestion: Since the data you are looking at is H3k4me3, one would expect broad peaks. It is good to run macs whenever you are dealing with epigenome data, with
--nomode
l parameter and if you know the fragment size, with--shiftsize
equal to half the fragment size (usually its ~200 for 40-50bp reads).I do not think this is a good advice, especially for beginner. First, H3K4me3 is not really the broadest histone mark there is, I think point peak calling should work. If it doesn't, the new version of MACS2 has a --broad parameter which should combine the nearby point peaks. Saying it is good to run macs [..] with
--nomodel
parameter just like that is quite irresponsible, in my opinion, especially to a beginner.Namely, these parameters are automatically estimated when running MACS in the default mode. If you provide them manually you are basically saying "I believe I can do better than MACS at estimating them". It is a strong assumption. You need to be extremely confident about the inner workings of MACS to do this in my opinion. Particularly, one needs to know is the MACS2 model in the first place, what
--shiftsize
* is for, what is the fragment length, how to figure it out, and once it is known, how to set the appropriate parameters of the software accordingly. Most importantly you would need to justify your choice of parameters, verify that the assumptions you make are actually correct, and be aware how that influences the resulting p-values. I, personally, would not feel too confident about doing this.* Actually,
shiftsize
is now removed from the latest MACS version, in favour of--extsize
which should be set to twice the length of the old shift size (see the--help
).P.S. How does one do inline code blocks like the author of the comment? Backticks don't work :(
The MACS parameters Dr. Liu suggested for H3K4me3 ChIP-Seq as below. I have tried different parameters for our H3K4me3 ChIP-Seq data, and found that the parameters Dr. Liu suggested are best. However, which parameters are better heavily depends on our data quality and data structure. Actually, it's case by case. Please see the detail of this great paper and other related papers.
Feng JX, Liu T, Qin B, Zhang Y, Liu XS. 2012. Identifying ChIP-seq enrichment using MACS. Nature Protocols 7:1728-1740.
The 'case per case' bit is precisely what I meant -- sticking with defaults at first is always a good shout. I believe we might be starting to confuse the OP, but, since this thread is starting to become a complete reference on MACS peak calling I will add that Roadmap Epigenome also use
--nomodel
and manual--ext
size for their data calling. They estimate the fragment length using strand cross correlation (link, section c).I would like to add this reference as well from the same author: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120977/
Take a look at the section "Critical Parametsrs and troubleshooting", where they explain suggested argumenst for epigenome data. Anyhow, along with
--nomodel
and--shiftsize
, you can also increase the--mfold
from default10,30
to10,100
(or even higher if you are looking for super enhancers using H3K27Ac) since most of the histone modifications are highly accumulated and generally cross the default threshold of 30, which would be missed by MACS for model building.Hi poisonAlien, may I have a question here? Now, I use ROSE (https://bitbucket.org/young_computation/rose) to identify super-enhancers based on H3K27ac or Med1 ChIP-Seq. I don’t know MACS can identify super-enhancers, too. Could you tell me how to manipulate it? Many thanks.
MACS does not identify super enhancers. It just gives you peaks (or just "regular enhancers"). You need to stich those closely spaced peaks and rank them according to their intensity, which is what ROSE does. You first call peaks with MACS, then use the called peaks (e.g: H3K27Ac.bed) as input to rose.
You may want to add
-t 2000
if you want to remove those peaks overlapping with tss. Again I am no expert, but this is what I follow.Thanks. It is my misunderstanding.
Thanks very much for your input to this. I am so glad that this website exist :) at least for beginners like me is a massive help.