Hi,
How could I calculate if the co-occurring of two TFBSs is higher than one would expect by chance? And this in either all promoters (1000 bp) or even in the complete genome within 1000bp of one another. I thought about this statistical problem but I am not an expert in probability ... and got stuck.
So my first thought was to calculate the random chance that two k-mers with length n and m co-occurr within 1000bp. But since I am not an expert in probability I am not sure how to calculate this. Any suggestions here?
Then I thought that the random chance might be unsuitable, as motifs are not simple k-mers but motif letter-probability matrices (meme format) so randomising the matrices might be a better idea? Maybe the best way would be to randomly switch around the values of each row? Would this be an acceptable approach? This there a tool for something like this?
It even gets more complicated as I have for one TFs three matrices which are quite similar. Here, I am not sure how to handle this. Any suggestions here?
Any insight is HIGHLY appreciated! Thank you :)
I think the hypergeometric distribution is correct. However, I think you want to set up a 2x2 table based on presence/absence of the TFs. Something like:
a
: number of promoters with both TFs;b
: number of promoters containing TF1 BS,c
:number of promoters with TF1 or TF2 present,d
: number of promoters containing TF2. See this CV post for a little more detail: https://stats.stackexchange.com/questions/10328/using-rs-phyper-to-get-the-probability-of-list-overlapYou could apply this to promoters and then to the 1kb bins of the entire genome.
Thank you Alex Reynolds, genome vs. promoter region is also in interesting way of looking at it. I will do that too.
And also ejm32 - thank you for the link to the post. Based on this link, I searched some more and I found this explanation, which also describes it very well! Thanks.
The only thing I am still slightly unsure of are the genes without any TFBS in their promoter. Are these a separate class (hypergeometric only applies to two) or are these considered as "not picked"?? You wrote
c:
number of promoters with TF1 or TF2 present - so I do have to exclude the genes without any TFBS?I would have considered these genes as part of the domain and TF1 & TF2 as the subsets. Hence, I would have put:
a:
(number of promoters with both TFs -1);b:
number of promoters containing TF1 BS,c:
total number of genes - number of promoters containing TF1 BS,d:
number of promoters containing TF2.You can put genes without TFBS1 or TFBS2 into the conditional table. My concern is that I feel it makes small overlaps significant. Furthermore if we go back to the balls in an urn analogy: the white balls = TF1 and the black balls = TF2. My advice do it both ways. I'll read your linked post and see if it sways me one way or the other.
Thank you very much for your input!!!
The idea with comparing genome vs promoter is that, in the ball-urn metaphor, the ball you're interested in is an event of co-occurrence, while the ball you're not interested in is an event of non-co-occurrence. The urn or "background" is the genome, from which you are sampling without replacement (promoters). Comparing frequencies of one TF versus a second TF seems to be asking a different question from co-occurrence, I think, but definitely tune this to what you think is appropriate, of course.
In the promoter vs non-promoter you are not testing if the TFs co-occur more frequently than expected. You are testing whether the two TFs co-occur more frequently in promoters than the rest of the genome. To me this assumes that the co-occurence of TF1/TF2 has already been established, which has not been done, yet. I think once an association is found OP can ask other questions related to where the co-occurence is most enriched. An even better comaprison might be to use DHS peaks/hotspots instead of promoters or compare promoters vs all other DHS (enhancers/cis reg elemets) or DHS vs non-DHS
Yes, of course these are two completely different questions - both interesting
A third question I am interested in would be if co-occurrence within a certain width is higher than chance. So let's say within 100 bp of one another in all promoters (or even genome). So these are then a subset of the co-occuring ones. TFBSs close to each other might indicate that the TFs that bind to them are more likely to be synergistic. Any ideas how to handle this statistically? Thanks again for your input!
On another note, would overlapping bins screw up the statistics?
Overlapping bins would complicate things... I think. I think overlapping bins would violate the independent draws assumption.
Thanks, I think so too...