Question

for low coverage RNAseq how many reads assigned is the bare minimum for differential gene expression analysis

0

Entering edit mode

6.3 years ago

senowinski ▴ 30

With low coverage RNAseq of human tissue - ~6million reads aligned using STAR. Of the 84 samples I have a range of reads aligned to genes of between 2-7 Million reads. What is the bare minimum number of reads I can use for differential gene expression analysis? What is a sensible cut-off? Ideally I would like to retain as many samples as possible.

RNA-Seq • 3.2k views

ADD COMMENT • link 6.3 years ago by senowinski ▴ 30

3

Entering edit mode

Depends on the genome. For example, you need more read depth for human alignments than you do for fly alignments.

What is the bare minimum number of reads I can use for differential gene expression analysis?

There's not really a bare minimum. Depends how sensitive your analysis is. Also depends on sequencing quality (how many good reads remain after processing) and genome size, as I mentioned already.

You should go ahead with the differential expression analysis. That part doesn't take that long. And if you decide to do more sequencing, you will have the differential expression pipeline already setup.

ADD REPLY • link 6.3 years ago by goodez ▴ 640

0

Entering edit mode

It's human alignments and when you say go ahead with the differential gene expression analysis, do you think I should try this analysis with all the samples?

ADD REPLY • link 6.3 years ago by senowinski ▴ 30

0

Entering edit mode

Well, as I say, it depends if they are outliers on metrics other than read count.

ADD REPLY • link 6.3 years ago by i.sudbery 21k

2

Entering edit mode

What is your organism? Six million reads is low coverage for human, but it is not for yeast, for example. And how are the 84 samples distributed within treatments? Literature shows biological replicates are more important than read depth per sample when it comes to statistical power.

ADD REPLY • link 6.3 years ago by h.mon 35k

2

Entering edit mode

We normally talk about reads in the sample, rather than reads assigned to genes. A dirty little secrete that people often don't talk about is that often only around a third (total ribo-delpleted) to two thirds (polyA) reads map to exons. So when some says they have 20M polyA reads, the probably only really have 13M assigned to exons.

I'd normalise your sample with DESeq2s rLog and see which samples stand out on the PCA/MDS. Do you have two-read samples that are a million miles away from all the other samples? Do they have other thigns wrong with them (GC distribution, over-represented sequences etc). If your low coverage samples cluster on a PCA/MDS with the high coverage ones, I'd probably use them. If they are miles away I'd discard them.

As was pointed out by @h.mon, a lot of power in RNA-seq comes from replicates rather than read number.