I've a bunch of genomic positions and I want to check if these positions are enriched in TF binding sites or enchancer ? My idea was to use ENCODE data and to do a simulation like this :
If I have N positions in my input list -> pick N random genomic positions and compute a score by checking in ENCODE data
Do that 100000 times and compute a p-value = #(random score) >= real score / N
Be sure to use only areas with similar mappability and possibly GC characteristics (there could be a bias there).
That's the normal way to do it.
This is how I'd approach the problem as well. One thing to keep in mind is that enhancers and promoters can be cell-type specific. In an ideal world, you'd just want to look in your cell-type of interest. Also, do make sure that regions have evolutionary conservation (i.e., try not to make the same mistakes as ENCODE). A final thing to think about is whether it makes sense to look at TF binding site in toto or whether it makes sense to just look at individual transcription factors. I say this since the meaningfulness of the results there's a slight enrichment due to complete enrichment of one or two TF binding motifs vs. a generic mild enrichment of everything (i.e., a slight but significant global enrichment is likely due simply to an uncompensated bias).
Just that you don't want to randomly draw from an area with a bunch of N's or that the TF binding sites would never have been called in to begin with (these would likely include repeat regions, though perhaps not).
Thanks Devon. Yes of course I'll take the right cell type. When you say areas with similar mappability, what do you mean?
Just that you don't want to randomly draw from an area with a bunch of N's or that the TF binding sites would never have been called in to begin with (these would likely include repeat regions, though perhaps not).
ok thanks that's a good idea.