Entering edit mode
5.8 years ago
senowinski
▴
30
With low coverage RNAseq of human tissue - ~6million reads aligned using STAR. Of the 84 samples I have a range of reads aligned to genes of between 2-7 Million reads. What is the bare minimum number of reads I can use for differential gene expression analysis? What is a sensible cut-off? Ideally I would like to retain as many samples as possible.
Depends on the genome. For example, you need more read depth for human alignments than you do for fly alignments.
There's not really a bare minimum. Depends how sensitive your analysis is. Also depends on sequencing quality (how many good reads remain after processing) and genome size, as I mentioned already.
You should go ahead with the differential expression analysis. That part doesn't take that long. And if you decide to do more sequencing, you will have the differential expression pipeline already setup.
It's human alignments and when you say go ahead with the differential gene expression analysis, do you think I should try this analysis with all the samples?
Well, as I say, it depends if they are outliers on metrics other than read count.
What is your organism? Six million reads is low coverage for human, but it is not for yeast, for example. And how are the 84 samples distributed within treatments? Literature shows biological replicates are more important than read depth per sample when it comes to statistical power.
We normally talk about reads in the sample, rather than reads assigned to genes. A dirty little secrete that people often don't talk about is that often only around a third (total ribo-delpleted) to two thirds (polyA) reads map to exons. So when some says they have 20M polyA reads, the probably only really have 13M assigned to exons.
I'd normalise your sample with
DESeq2
s rLog and see which samples stand out on the PCA/MDS. Do you have two-read samples that are a million miles away from all the other samples? Do they have other thigns wrong with them (GC distribution, over-represented sequences etc). If your low coverage samples cluster on a PCA/MDS with the high coverage ones, I'd probably use them. If they are miles away I'd discard them.As was pointed out by @h.mon, a lot of power in RNA-seq comes from replicates rather than read number.
For reference https://academic.oup.com/bioinformatics/article/30/3/301/228651