Check my work? Hyper geometric test on enrichment between iCLIP sites across experiments
2
1
Entering edit mode
8.1 years ago

Hi I was hoping for some more feedback on this test I am trying to perform to check for enrichment at specific sites throughout the genome and my statistics background isn't great.


Here is the current setup:

Given: - genomic coordinates of iCLIP binding sites (single nucleotide position corresponding to the site of a crosslink) from two different proteins. Sample A and Sample B

Goal: - Researcher wants to put a p-value on whether there is a greater number of Sample A positions nearby to Sample B positions than you would expect to observe by chance. 60-nt bins were chosen for a biological reason related to the protein from Sample B.


Setting up the test:

Step 1: split genome in 60 nt bins (do both strands separately) and count the total number of bins --> total number of balls in the urn

Step 2: count the number of bins overlapping with one or more Sample A positions --> total number of white balls in the urn

Step 3: count the number of bins overlapping with one or more Sample B positions --> total number of balls drawn without replacement from the urn

Step 4: count the number of bins overlapping with one or more Sample A and Sample B position --> total number of white balls drawn without replacement from the urn


Does this seem like an acceptable test to do? Or is there a better test for this kind of scenario.

If anyone is interested this is how the p-value is generated in R for the test: 1-phyper(q=step 4, m= step 2, n= step 1 - step 2, k = step 3 )

hypergeometric enrichment • 2.2k views
ADD COMMENT
0
Entering edit mode

Also have a look at the genometricorr R package.

ADD REPLY
1
Entering edit mode
8.1 years ago
fanli.gcb ▴ 730

That seems like a fairly reasonable approach. This link has some other methods too: Association between bed files - statistical significance

Personally, I'd do a quick permutation test just to reassure myself (as well as have some visual representation for collaborator/PI). Something like count the number of overlapping bins in a million permutations and see where your true #overlaps falls in that null distribution.

ADD COMMENT
1
Entering edit mode
8.1 years ago
michael.ante ★ 3.9k

Hi benformatics,

I also like your approach. I only would reduce the number in step 1, e.g. by using only annotated binding sites or the UTR areas. If there is a strong statistical dependency between the two proteins, you need to model that. You can use in that case a Monte Carlo sampling approach like in http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012432 in order to compute empirical p-values.

Cheers, Michael

ADD COMMENT

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6