Hi I was hoping for some more feedback on this test I am trying to perform to check for enrichment at specific sites throughout the genome and my statistics background isn't great.
Here is the current setup:
Given: - genomic coordinates of iCLIP binding sites (single nucleotide position corresponding to the site of a crosslink) from two different proteins. Sample A and Sample B
Goal: - Researcher wants to put a p-value on whether there is a greater number of Sample A positions nearby to Sample B positions than you would expect to observe by chance. 60-nt bins were chosen for a biological reason related to the protein from Sample B.
Setting up the test:
Step 1: split genome in 60 nt bins (do both strands separately) and count the total number of bins --> total number of balls in the urn
Step 2: count the number of bins overlapping with one or more Sample A positions --> total number of white balls in the urn
Step 3: count the number of bins overlapping with one or more Sample B positions --> total number of balls drawn without replacement from the urn
Step 4: count the number of bins overlapping with one or more Sample A and Sample B position --> total number of white balls drawn without replacement from the urn
Does this seem like an acceptable test to do? Or is there a better test for this kind of scenario.
If anyone is interested this is how the p-value is generated in R for the test: 1-phyper(q=step 4, m= step 2, n= step 1 - step 2, k = step 3 )
Also have a look at the genometricorr R package.