I have 3 questions about running the ChromHMM tool to find the combinatorial states of multiple ChIPseq data.
I have replicate ChIPseq data(4 replicates each) for H3K4me3 and H3K27me3 in control vs treatment. For ChromHMM, do I need to merge all the replicates and then run it, or I can run the replicates individually?
As the tutorial suggests, one should start with the bed file coming from the original bam alignment file. But has anyone tried it by using Macs2 peak files? I mean, using MACS2 first to call reliable peaks with a cutoff, and then using ChromHMM to call the combinatorial chromatin states?
Also, as I have two groups, control vs treatment. In each group, I have both H3K4me3 and H3K27me3 ChIPseq data(with 4 replicates). Now, to define bivalent chromatin states, do I give all the control and treatment H3K4me3 and H3K27me3 data in ChromHMM to learn the model or only the control data?
One of the first steps of ChromHMM is peak calling. Therefore, you need to use the bed files from bam files and not peak files. Although the default settings (i.e. ) would give you less stringent calls (argument: -p or poissonthreshold in BinarizeBed function). You may change it 1e-5 or less. If you are interested in using MACS peak calls then you may use the argument -peaks within the BinarizeBed function. However, the latter is not recommended for broad peaks marks such as H3K27me3 and H3K9me3.
Yes, you have to give control files for all the treatment file. It is again required for calling the peaks.
Thank you very much for your quick answer! It makes sense. However, regarding answer 3, the control file here means, the sample which did not receive the drugs. It is not like, the input control or IgG control for ChIP. Then do you think I should still include them for learning the model? Or, learn the model with only the control group > get the emission states containing both H3K4me3 and H3K27me3, then use those regions(supposedly bivalent regions) to do differential analysis in treatment vs control?
Or do you have any other idea of getting bivalent promoter regions from H3K4me3 and H3K27me3 data? I was thinking at first to just use "bedtools" to intersect the MACS2 peak calling files to get overlapping regions containing H3K4me3 and H3K27me3 and use those genomic regions for differential analysis in my drug treatment vs control samples. Does it make sense?
Thanks a lot for taking the time to help me! Really appreciate it!
It is not like, the input control or IgG control for ChIP. Then do you think I should still include them for learning the model?
You can include them but a lot of your called by ChromHMM or MACS would have false positives. Ideally, it should be a whole genome input control or IgG-treated file.
Or, learn the model with only the control group > get the emission states containing both H3K4me3 and H3K27me3, then use those regions(supposedly bivalent regions) to do differential analysis in treatment vs control?
No need to have 2 different models. Use only one model having all the files. Otherwise, it would be tough to infer anything using both the models.
If you have only two marks and are interested in bivalent regions then I would not recommend you to run ChromHMM. It is useful when you have more than 3 marks. For bivalent regions, call MACS using each mark and then do the intersection of the peaks using bedtools or bedops. If you want to do differential analysis, then use DIffBind.
Then ChromHMM will do the equivalent of combining all the bam files together. It doesn't do any sort of checking of agreement between replicates, it just dumps everything together. Theoretically the peaks that are only in one or two replicates will then become very small, but ChromHMM's binarization is pretty sensitive. This is definitely closer to getting all the regions that are enriched in any replicate than it is to getting only the regions that are enriched in all the replicates. You could instead do this:
And this would give you a separate chromatin segmentation for each replicate. You could then do some type of checking of agreement between replicates. Or, you could do peak calling and some type of peak overlapping, then use those peaks as input to BinarizeBed with the -peaks parameter. It really depends on if you prefer to capture all the regulatory elements that are possibly there at the expense of false positives (first method) or try to limit the false positives but risk missing some (second or third method).
Personally I like the idea of using peak calls as the input to ChromHMM, but have yet to see a paper that did it this way, so I'm hesitant to do it in my own analysis.
Thank you very much for your quick answer! It makes sense. However, regarding answer 3, the control file here means, the sample which did not receive the drugs. It is not like, the input control or IgG control for ChIP. Then do you think I should still include them for learning the model? Or, learn the model with only the control group > get the emission states containing both H3K4me3 and H3K27me3, then use those regions(supposedly bivalent regions) to do differential analysis in treatment vs control?
Or do you have any other idea of getting bivalent promoter regions from H3K4me3 and H3K27me3 data? I was thinking at first to just use "bedtools" to intersect the MACS2 peak calling files to get overlapping regions containing H3K4me3 and H3K27me3 and use those genomic regions for differential analysis in my drug treatment vs control samples. Does it make sense?
Thanks a lot for taking the time to help me! Really appreciate it!
It is not like, the input control or IgG control for ChIP. Then do you think I should still include them for learning the model?
You can include them but a lot of your called by ChromHMM or MACS would have false positives. Ideally, it should be a whole genome input control or IgG-treated file.Or, learn the model with only the control group > get the emission states containing both H3K4me3 and H3K27me3, then use those regions(supposedly bivalent regions) to do differential analysis in treatment vs control?
No need to have 2 different models. Use only one model having all the files. Otherwise, it would be tough to infer anything using both the models.If you have only two marks and are interested in bivalent regions then I would not recommend you to run ChromHMM. It is useful when you have more than 3 marks. For bivalent regions, call MACS using each mark and then do the intersection of the peaks using bedtools or bedops. If you want to do differential analysis, then use DIffBind.
That's a very good advice! Thank you very much for your suggestion!