I have two genomic intervals, let's call them Early and Late, and their activity is measured in raw counts in quantitative sequencing experiments; let's call these numbers E and L. In some kind of biomarker analysis, am interested to know when E > L and when E > 3L. Obviously, the total number of counts (E + L) is a function of how much budget I spend on sequencing.
I am looking for a simple way to decide what is the minimum amount of sequences for statements such as "E > L" to make sense. For instance, if E = 2 and L = 1, my experience in the field tells me that the total number of counts is too low to draw a conclusion. I have a rough intuition about some keywords that are relevant to answer my question (binomial distribution, confidence interval, Poisson noise, ...) but I am stuck. Could somebody suggest me a method to determine what is the least amount of sequencing needed to determine confidently when E > nL ?
Are you going to conduct RNA-seq on those two intervals? Or is it some targeted sequencing that you are looking for?
We are using CAGE (Cap Analysis Gene Expression) libraries of virus-infected cells, and the genomic intervals are viral promoters. (And yes, targeted enrichement is also planned, but that is a different story.)
Assuming the counts are Poisson-distributed with rate r, for r sufficiently large (> ~20, but the approximation is already quite good before this, it only improves as r increases), the Poisson distribution could be approximated by a Gaussian distribution with mean r and variance r. You could also view this as testing the ratio of the rates of two Poisson distributions, for this have a look at the R package rateratio.test.