Question

Millions of Reads Needed for Differential Gene Expression

2

Entering edit mode

5.2 years ago

contact ▴ 20

Hi,

I have read in a few places that one needs ~10-20 million reads per sample to do differential gene expression analysis (see Harvard RNA-seq Tutorial,"Improving mean estimates (i.e., reducing variance) with biological estimates" section as an example).

However, I'm wondering does this refers to the total number of reads or the number of reads which fall within the coding portions of the genome. I have seen samples which have ~80 million reads, but when looking at count data in DESeq2, the sum of counts for the sample in question only has a count sum of ~3 million.

So, I'm curious how to interpret this ~10-20 million reads per sample rule-of-thumb.

Thanks

differential-gene-expression RNA-Seq deseq2 • 3.4k views

ADD COMMENT • link updated 7 months ago by Ram 44k • written 5.2 years ago by contact ▴ 20

3

Entering edit mode

I normally expect between 1/3 and 2/3 of reads sequenced to fall into exonic regions (depending on many thing, but primarily if the sample is total RNA or polyA RNA). There is something weird about a sample where only 3/80 million reads map to exons.

Also, I'd say 10-20 million reads is a bit on the low side.

ADD REPLY • link 5.2 years ago by i.sudbery 20k

score 2 · Answer 1 · 2019-09-30

2

Entering edit mode

5.2 years ago

guillaume.rbt ★ 1.0k

Usually the number of reads required corresponds to the number of reads that will be sequenced, and not the number of counts expected.

ADD COMMENT • link 5.2 years ago by guillaume.rbt ★ 1.0k

score 0 · Answer 2 · 2019-09-30

0

Entering edit mode

5.2 years ago

h.mon 35k

You are not faithfully quoting the tutorial you have linked. The tutorial states:

Generally, the minimum sequencing depth recommended is 20-30 million reads per sample, but we have seen good RNA-seq experiments with 10 million reads if there are a good number of replicates.

# of reads per # of DE genes

The issue is relatively complex, but well explored (in human, mouse, Arabidopsis and yeast) - this means we have good "rules of thumb", but these rules may be frequently off.

A general consensus is more biological replicates is better than sequencing depth at increasing statistical power, with the added benefit a good number of biological replicates allows one to truly detect and remove outlier samples. Other considerations are (as i.sudbery already said) the percentage of reads being assigned to a gene when quantifying expression; the size of the effect expect from the experiment; the quality of the underlying genome and its annotation; and so on.

ADD COMMENT • link 5.2 years ago by h.mon 35k

0

Entering edit mode

The link was an example. I have seen other papers say a good rule-of-thumb is ~10 million reads.

Here is some more info regarding our experimental design: The samples are total RNA, and I believe were ribo-depleted. We have 2 groups/condition (drugged vs. non-drugged), non-drugged as 3 replicate and drugged has 9 replicates. The samples are patient-derived-xenografts from human cancer. The sequenced reads deriving from mouse have been filtered out. From the reads left, we generally get only ~3 percent aligning to exonic regions, from a total of around ~80-90 million reads. My question is, are we good to look at DEGs?

ADD REPLY • link 5.2 years ago by contact ▴ 20

2

Entering edit mode

I guess you'll have to have a look, but if you are looking at considerably less than 3 million reads, I'm guessing you are going to be significantly underpowered and will find DE only for the very highest expressed genes.

You should really look into why only 3% of reads are aligning to exonic regions - something isn't right there.

Are the reads just not aligning, or are they aligning somewhere else?

ADD REPLY • link 5.2 years ago by i.sudbery 20k