Hi, I am quite a novice for NGS analysis. I had a conversion with my colleagues about peak calling for ChIP seq data. I got confused about several comments she mentioned.
comment one "MACS is too old, and no one use it any more", is that true? It seems to me that it is still the most widely used, if not the best, peak caller now. I could still see it being used in the most recent high profile journals.
"Setting the parameter of peak calling is an art", OK, i made this up. She said it is essential to tailor parameters to fit each set of data. I understand that it is important to adjust the parameters based on whether the enrichment is broad (many histone modifications) or narrow (TFs). but she comments seems suggest that you could tailor you parameter as much as I want, as long as you apply the same parameter to the same dataset you are supposed to look at. Are there general rules or principles.
Another is that are there any peak caller optimized for low cell number ChIP seq, say 110^5 as input for H3K27ac ? I applied the MACS14 to two inputs using the same parameters, one has 10 times cell numbers ( 110^6 ) , the generated bed could be used for downstream analysis, but the low cell number inputs had some issue with downstream analysis.
Thank you for any suggestion and comments. (simple links to relavent literature would be appreciated)
I would agree with your colleague that peak calling is an "art". It's actually more like witchcraft.
MACS is quite old but as far as I can tell, none of the newer peak callers are much better. I use the peak callers to start, then I filter through a human eyeball attached to a brain, and I use the lab to verify. Select a range of peaks and ChIP-qPCR until my replicates start failing. That's my personal workflow.
The ENCODE guidelines are a good place to start.
Cell numbers requirements will vary between different types of samples. For example, you can get away with far fewer cells for TFs (point source) compared to histone mods (broad source).
I would avoid peak callers at every opportunity. Their very purpose in life doesn't make sense to me. So you want to take 2 dimensional data and make it 1 dimension? Ok. Why?
John (I assume your question is rhetorical), are you suggesting using the continuous signal throughout the genome as input for further analysis, like differential binding, motif analysis, etc? In principle it's not bad idea I think, but it's just impractical for most purposes.
Personally I tend to see the peak calling exercise as a first step to thrash 99.9% of the genome as "not interesting" and the rest (which is still a lot of stuff) as "worth a further look".
Peaks are still 2D as they have can be quantified as enrichment over flanking regions or input, isn't it?
Unfortunately, this time i'm not joking around. Their existence is, in my opinion (and i'm no one special, I don't even have a doctorate), a poor way of dealing with an otherwise hard problem. I feel that they exist only to reduce the complexity of the piled up signal to a list of "interesting regions", for no reason other than it makes working with the data more manageable. There is no scientific or mathematical utility in doing this, other than the existing tools we have at our disposal take regions of genome as input, not a BigWig/Bedgraph (with the assay dynamics encoded in them somehow).
When I first used peak callers 3 years ago, I thought they were silly. A quirk of a new line of research that hasn't settled in yet. I was quite verbose about it on Biostars and elsewhere too, but over time I realised that:
I was wrong about point 2. Peak callers caused researchers to think and reason about ChIP-Seq as a collection of "peaks", which muddled many people's thinking. Are they "broad" peaks or "narrow" peaks? How much should two peaks have to overlap before they're considered overlapping? Should that cut-off be in bases or as a % of the peak width? Should I use peak confidence scores or sum of signal in the peak regions? And the list goes on and on and on..
But i'd be OK with it if people knew what they were doing and where happy with the process - but no, due to all the artificial complexities thrown up by that unnecessary abstraction, we now need an abstraction for our abstraction. Chromatin State callers. We don't even look at regions any more, we look at colours -- often more colours than we had dimensions of data to begin with.
I would second that, peak calling might not be represented as envisioned but it will definitely help us pointing the 5% or 10% interesting regions in the genome where you have binding of important TFs/histones that might give an idea of the downstream biological effects. Obviously motif enrichment or differential binding is one way to pursue and avoid peak calling but the motivation lies in the user, if one wants to see for enriched regions around the TSS and that might be having motifs then stricter peaks can be used to model that, it is just a way of quantification to remove non important regions. At the end of the day it will be dependent on the biological question and the resources that can be used and exploited to either find enriched TFs or histone modifications across regions of interest in the genome that might drive the phenotype and then integrate it to expression data, in fact helps us to identify the impact of this regions on expression of your subsets of samples that have both ChIP-Seq and RNASeq. There are other ways of also finding promoters and enhancers from ChIP-Seq data rather than distance-based metrics without calling peaks but then it is something that will be sole interest of the study concerned.