Hi guys, When reading through many papers, I see many people people mentioning features like H3K4me1+ (or hight), H3K4me3-,.... etc, generally they don't say how they did it automatically to find these regions.
To make the picture clear, my situation is as follow:
- I have some Chip-Seq data (One sample per mark for a certain cell line)
- let's say for the H3K4me1 sample I already did the peak calling and I have the list of peaks and the signal intensity in each peak
- I want to find which peaks are highly enriched (H3K4me1+) and which peaks a lowly enriched (H3K4me1-)
- A threshold need to be defined to do that
I found this paper (Combinatorial patterns of histone acetylations and methylations in the human genome) that described a method and I don't know if I got their meaning right. I hope through this post to get you kind clarifications :) The authors say:
The modification on a promoter under consideration was deemed significant when the tag count was higher than a threshold, which was determined by a P value taken from a background model of Poisson distribution parameterized by the genome-wide tag density
does it mean to just do :
lambda = mean(tags)
Then if region X has k tag, I will just do:
pval = Poisson( X>= k; lambda)
and consider the one with significant p-values as enriched.
Or it is not the case here?
Edit: An additional question in the context of this question. For the tag enrichment plot, is it preferable to calculate it using the final BED file or using a BigWig file? when using histone marks called reads, you get large regions with one value example
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chr18 [8118742, 8128768] * | 7.1
so when I do the mean of the tag count it sounds more over estimated.
Thanks in advance.
Shall we consider the length of the region? I guess the P-value should be calculated as:
Pval = Possion (X>=k; lambda*region_length)
Yeah,
A more elaborated model is in the jmosaics model
http://www.genomebiology.com/2013/14/4/R38