Question

How Do I Estimate The Number Of Rna-Seq Reads Needed To Detect Differential Expression?

1

Entering edit mode

11.0 years ago

scottdaniel25 ▴ 30

x-posted at stack exchange's 'cross calidated' here.

Currently, our samples are going to get at least 30 million reads each. Given that we have 3 biological replicates per condition, that gives us 90 million reads. Let's say we only get 50% of those aligned and only count 100bp segments (even though Illumina has paired-end, I think that only improves accuracy of mapping). That gives us 45e6 x 100 = 45e8 bp of reads.

The Drosophila exome (including non-coding genes) is 17e3 genes x Avg 6e3 bp per gene = 1e8 bp. 45e8 bp of reads / 1e8 bp coding and non-coding exome = 45x coverage. My numbers for the drosophila exome are from here: http://flybase.org/static_pages/docs/release_notes.html

One additional complication: Due to the nature of the experiment, only about 1/3 of the animal is affected by my experimental condition. My worry is that the other 2/3 will "cover-up" any gene expression changes that I might see in the affected tissue.

The question is how many genes will I actually be able to detect as differentially expressed at this 45x coverage? Am I only going to be able to detect genes with high abundance? Will increasing it to 75x help or just be a waste of money?

I was told by one professor to not worry about it and another that it is a hypergeometric test in some way. Thanks much.

p.s. I've found this tool to do power analysis with RNA-seq: http://euler.bc.edu/marthlab/scotty/scotty.php but I don't have pilot data (and don't quite know how to make simulated data) to use it with.

rna-seq differential-expression reads • 9.5k views

ADD COMMENT • link updated 8.3 years ago by Biostar 20 • written 11.0 years ago by scottdaniel25 ▴ 30

score 8 · Answer 1 · 2013-11-20

8

Entering edit mode

11.0 years ago

Charles Warden 8.3k

All power calculations really need some sort of pilot data in order to give reasonable estimations.

Here is what I can tell you from human and mouse data that I have worked with:

1) 50% alignment is pretty bad. I typically see at least 80-90% alignment and consider <70% alignment a red flag.

2) In those cases, 10 M reads per sample is really sufficient for most gene-based comparisons. Above that, I would say biological replicates are more important than higher coverage. To be safe, it is probably best to ask for at least 20 M reads per sample.

3) Gene expression differences will vary widely depending upon conditions. This is why it is really not possible to guarantee any size gene list without additional information about the specific sample. Sometimes you'll see 100s or 1000s of differential expressed genes. Sometimes you won't see any (with, say, FDR < 0.25 or preferably FDR < 0.05)

4) When you give your RNA sample to your core (or company), they should usually only need a small portion of your sample. In general, it shouldn't be too hard to produce more reads for the same sample in a later run. For this reason, I would probably recommend just starting with the standard coverage offered.

ADD COMMENT • link 11.0 years ago by Charles Warden 8.3k

0

Entering edit mode

Excellent answer. One comment though

50% alignment is pretty bad. I typically see at least 80-90% alignment and consider <70% alignment a red flag.

True for paired-end libraries, not for single-end

ADD REPLY • link 11.0 years ago by Irsan ★ 7.8k

2

Entering edit mode

I usually see 80-90% with single-end as well (unless we're talking unique alignments, in which case nevermind!). This is likely organism-dependent.

ADD REPLY • link 11.0 years ago by Devon Ryan 104k

0

Entering edit mode

I actually work almost exclusively with single-end libraries and see those stats, but I agree with dpryan79 that it may depend on the reference genome (which is why I only felt confident those stats would be relevant for mouse and human).

ADD REPLY • link 11.0 years ago by Charles Warden 8.3k

0

Entering edit mode

Dpryan and cwarden, you are both right. I based my comment on my experiences on 1 single end library (6 human rna samples) that turn out to have major QC issues!

ADD REPLY • link 11.0 years ago by Irsan ★ 7.8k

score 0 · Answer 2 · 2013-11-22

0

Entering edit mode

11.0 years ago

scottdaniel25 ▴ 30

So I think I sort of answered it... from a very rough perspective.

What I did was use R to simulate Fisher.test's for differential expression based on number of reads. I figured it would give a good approximation of changing the range of reads (i.e. from 1 to 45 vs 1 to 75 as stated in the problem).

Code and output here (it's too long for a post)

ADD COMMENT • link 11.0 years ago by scottdaniel25 ▴ 30

0

Entering edit mode

A few comments

please don't make variable names such as p_values_of_read_depth_fourtyfive and p_values_of_read_depth_seventyfive I understand that you are trying to be informative (though that does not work either since you have an list rather than a single value there) but overall an overly long name is less helpful than than calling it just x.
then you should summarize what you plot since it is not clear what they are showing. What is p_values_of_read_depth_fourtyfive[,2]? At this point the overly lengthy name is more confusing as ever. That is clearly not what you are calling it as. You did a lot of work on this but it is not all that helpful since it is not clear what you are showing.

ADD REPLY • link 8.3 years ago by Istvan Albert 102k