How to estimate the minimal amount of sequencing required for a biomarker analysis ?
1
0
Entering edit mode
6.9 years ago
Charles Plessy ★ 2.9k

I have two genomic intervals, let's call them Early and Late, and their activity is measured in raw counts in quantitative sequencing experiments; let's call these numbers E and L. In some kind of biomarker analysis, am interested to know when E > L and when E > 3L. Obviously, the total number of counts (E + L) is a function of how much budget I spend on sequencing.

I am looking for a simple way to decide what is the minimum amount of sequences for statements such as "E > L" to make sense. For instance, if E = 2 and L = 1, my experience in the field tells me that the total number of counts is too low to draw a conclusion. I have a rough intuition about some keywords that are relevant to answer my question (binomial distribution, confidence interval, Poisson noise, ...) but I am stuck. Could somebody suggest me a method to determine what is the least amount of sequencing needed to determine confidently when E > nL ?

statistics biomarker • 1.7k views
ADD COMMENT
0
Entering edit mode

Are you going to conduct RNA-seq on those two intervals? Or is it some targeted sequencing that you are looking for?

ADD REPLY
0
Entering edit mode

We are using CAGE (Cap Analysis Gene Expression) libraries of virus-infected cells, and the genomic intervals are viral promoters. (And yes, targeted enrichement is also planned, but that is a different story.)

ADD REPLY
0
Entering edit mode

Assuming the counts are Poisson-distributed with rate r, for r sufficiently large (> ~20, but the approximation is already quite good before this, it only improves as r increases), the Poisson distribution could be approximated by a Gaussian distribution with mean r and variance r. You could also view this as testing the ratio of the rates of two Poisson distributions, for this have a look at the R package rateratio.test.

ADD REPLY
2
Entering edit mode
6.9 years ago
Erik Arner ▴ 20

How about doing a binomial test, where E (or L) is the number of successes, E + L is the number of trials, and p = 0.5? In R your example with E = 2 and L = 1 would then be:

binom.test(2, 3, p=0.5)

which would not be significantly different, whereas e.g. E = 200 and L = 100 would be.

ADD COMMENT
0
Entering edit mode

Thanks a lot Erik, so said differently, it looks like I would (for instance) need at least 100 counts if I want to be at least 95% sure that a E / L ratio of 0.58 really indicates that E > L.

> sapply(1:10 * 10, function(n) binom.test(c(n/2, n/2), p=0.5, alternative = "l")$conf.int) %>% t %>% set_rownames(1:10 * 10) 
    [,1]      [,2]
10     0 0.7775589
20     0 0.6980461
30     0 0.6611073
40     0 0.6389083
50     0 0.6237541
60     0 0.6125890
70     0 0.6039339
80     0 0.5969763
90     0 0.5912285
100    0 0.5863783
ADD REPLY
0
Entering edit mode

Yes, but keep in mind that if you're doing multiple samples you may (will) have a multiple testing issue so you'll have to take that into account when choosing your required counts.

ADD REPLY

Login before adding your answer.

Traffic: 1911 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6