Question

P Value Or Statistical Significance Of Real Peak Compared To Random Peak Overlaps

16

Entering edit mode

11.7 years ago

biorepine ★ 1.5k

Dear Biostars,

This might be one of the most obvious statistical related question in high-throughput sequencing data analysis. The question is, how one can calculate the enrichment of real versus random regions/peak overlaps?

For ex: The overlap between sox2 peaks and oct peaks is statically significant or not ?

My total no.of sox2 peaks = 4000
The no.of sox2 peaks that overlap oct4 = 2500
The no.of random sox2 peaks that overlap oct4 = 20

I agree that above example doesn't even need a statistical test to confirm the enrichment of 2500 over 20. But how one can statistically show this significance of enrichment as a p value per se ?

I was doing some thing like this. Do you think it is correct ? If not could you please suggest a better way ? Many thanx in advance!

= log (((The no.of sox2 peaks that overlap oct4 - The no.of random sox2 peaks that overlap oct4)/My total no.of sox2 peaks)*100)
= log ( ( (2500-20) / 4000) 100)

chip-seq • 7.7k views

ADD COMMENT • link updated 8.0 years ago by i.sudbery 20k • written 11.7 years ago by biorepine ★ 1.5k

0

Entering edit mode

look at KS test : http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

ADD REPLY • link 11.7 years ago by Gjain 5.8k

Istvan Albert · Answer 1 · 2013-03-04

This is a very important question! One that I do not think has been not satisfactorily solved yet!

It has been asked before on Biostars as: A: Annotating chip seq: how to get enrichment over random background and A: How do you calculate if two sets of genomic regions overlap significantly? .

I am still interested in the results of the Genomic Hyperbrowser. But it is not a trivial exercise to determine what the best null model is.

I know the following does not address the statistical analysis, but i think it is important nonetheless:

One of the most important aspects of your question is where the random sequences are coming from. I don't think you stated the origin of yours. I am currently favouring the use of bedtools shuffle that will take your genome coordinates and shuffle them within (or not if you choose) the same chromosome and excluded them from undesirable regions. By undesirable i mean regions of the genome that cannot be sequenced (mappability) or does not contain good sequences (gaps), both of which i obtain from the UCSC Browser.

I look forward to seeing whether anyone offers a good solution to this question!

score 3 · Answer 2 · 2013-03-04

3

Entering edit mode

11.7 years ago

Istvan Albert 102k

Giving statistical advice is a treacherous business as no problem is ever as simple as one thinks - moreover the person asking the question almost never provides the correct and full description of the problem. I noticed that a statistician will never give you an answer straight away, they will say things like: let's talk about it more then they ask a whole bunch of questions some of which are really hard to answer.

In general I like to think in terms of problem categories rather than an exact solution to one particle problem. Your data sounds like a contingency table type so perhaps a Chi-square or Fischer exact test is proper to test for the differences in the proportions.

ADD COMMENT • link 11.7 years ago by Istvan Albert 102k

1

Entering edit mode

You might be right. I have seen Bing Ren's paper (http://www.ncbi.nlm.nih.gov/pubmed/22763441) using Fisher exact test in their overlapping analysis. However, if I want to compare sox2 and random sox2 peaks peaks with more than one TF peaks (for ex: with oct4, klf4, p300 and cmyc peaks) , fisher test won't work I guess. Anyways, I would love to see if you guys also comment on my suggested method as it was showing what I anticipated.

ADD REPLY • link 11.7 years ago by biorepine ★ 1.5k

0

Entering edit mode

R has a good (i think) implementation of the Fisher test. You add in a four column table (overlap / no-overlap in both sets, e.g. test.csv) and can run the following:

table <- read.csv("test.csv") fisherList <- apply(table, 1, FUN=function(x) fisher.test(matrix(x,nr=2), workspace=1000000, alternative="two.sided")$p.value) write(fisherList, file="test_results.txt", sep="\n")

Apparently the Barnard Test is better, but i have not tried it in R yet.

ADD REPLY • link 11.7 years ago by Ian 6.1k

score 3 · Answer 3 · 2013-03-04

3

Entering edit mode

11.7 years ago

Alastair Kerr 5.3k

The data that you describe lends itself to a likelihood ratio test, e.g. Chi-Squared. However some more thought should be applied to defining a proper null hypothesis. Even then, you need to consider having biological replicates.

Have a look at Rory Stark's R-package DiffBind.

ADD COMMENT • link 10.0 years ago by Alastair Kerr 5.3k

0

Entering edit mode

How did that "other method in pre-publication" go?

ADD REPLY • link 10.0 years ago by Aaron Statham ★ 1.1k

score 0 · Answer 4 · 2016-11-22

0

Entering edit mode

8.0 years ago

i.sudbery 20k

Have a look at the GAT software.

ADD COMMENT • link 8.0 years ago by i.sudbery 20k