Question

How to evaluate the statistical significance of distribution of breakpoints between two datasets.

3

Entering edit mode

9.7 years ago

alec_djinn ▴ 390

I am studying the distribution of breakpoints among different human genomes looking for hotspots in the "samples" genomes that are enriched in breakpoints. To do so, I have divided the each chromosome in bins of 10Kb and the I have counted how many breaks are present in each bins. I have done the same for some control datasets and for randomly generated datasets. At this point, what is the best statistical test I could use to determine the p value for each bin?

The data I have looks like this:

                        Sample       Control
Breaks_Bin1             10           3
Breaks_bin2             15           6
Breaks_bin3             5            3

statistic • 2.4k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.7 years ago by alec_djinn ▴ 390

Ram · Answer 1 · 2015-05-11

0

Entering edit mode

9.7 years ago

dariober 15k

The way you present the problem it looks like you want to detect differences in counts between conditions. In this case I would look for methods developed for differential gene expression from RNA-Seq (DEseq, edgeR, limma/voom). Your 10kb windows would be "genes" and your break counts would be expression levels. If you don't have replicates of each condition, take care how you interpret the results though. Probably you need to pre-filter your data to remove windows with very low counts in both conditions to cut down the number of tests.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.7 years ago by dariober 15k

0

Entering edit mode

Yes, indeed I am trying to detect counts differences between samples and controls. However, since the data comes from different labs, I am looking for a proper statistical approach to validate the findings, to determine whether the difference in counts is significant (p value) or not and I would like to do it using a scipy.stats function or something similar. However I cannot figure out what approach is the best. Chi2, Fisher, Pearson? I am getting different results from each of them and I am not sure which one fit best for my data.

ADD REPLY • link 9.7 years ago by alec_djinn ▴ 390