I often get this question from collaborators and PIs trying to plan their experiments and budgets. How much coverage is sufficient for an RNA-seq experiment?
One problem with this question is that having a single meaningful coverage value is difficult for RNAseq. Any sample might have a different total amount of transcription, different numbers of transcribed genes/transcripts, different amount of transcriptome complexity (more or less alternate expression) and a different distribution of expression levels for those transcripts. Not to mention common confounding factors like 3' end-bias. All of these factors effectively alter the denominator for any overall coverage calculation. More useful metrics in my opinion are things like total number of reads (and percent of those which map to transcriptome) and total number of transcripts detected with at least X% of junctions with at least X coverage. We usually target at least 10k transcripts with at least 50% of their junctions with at least 10 or 20x coverage. That is approximately what we currently get from a single hiseq lane of 200-300M reads.
But, how much coverage is sufficient? It's even harder to answer this as it really depends on what you are hoping to accomplish. If you only need gene expression levels equivalent to say an Affymetrix gene expression array then it is probably more than sufficient. Same if you only want to validate variants in medium to highly expressed genes. But, I would argue that if that's all you want, then don't waste time/money with RNAseq. What we hope to get from RNA-seq are the above two items plus also confirm variants in lower expressed genes, get good estimates of expressed VAFs, identify lowly or rarely expressed tumor-specific isoforms, show significant differences between alternative splicing patterns, etc. For all these purposes, the one hiseq lane described above is just enough to get us started in my opinion. At present I think it is a good compromise between cost and benefit. But, as prices go down for sequencing we will want to increase it, not decrease it.
We recently found a known promoter mutation (TERT) in some tumors (HCC) we were studying. The mutation is predicted to increase binding of a transcription factor and has been shown to drive subtle but significant 2-4 fold increases in transcription. When we look at expression levels for this gene in RNAseq data we just barely detect it. In fact, the FPKM levels would normally be considered in the noise range. A typical filter of FPKM>1 across at least 20% of samples would eliminate this gene before even testing for a significant difference between normal/tumor or mutant/wildtype. This is a very important cancer gene, with a known mutation causing functional up-regulation that is almost undetectable at current depth levels if we don't already know to look for it! So, I argue that more depth is still needed (cost permitting). Would love to hear other people's thoughts on this.
One issue to keep in mind: some (many? most?) samples are dominated by reads from just one or a few genes. Since these genes take up a large fraction of your fixed supply of total reads, the rest of the genes get lower coverage as a result. For example, in blood samples, globin genes generally make up 50% or more of the total reads, leaving you with less than 50% of the reads for the rest of the transcriptome. Keep this in mind when deciding how many reads you need, your samples are known to be dominated by certain transcripts.
Indeed. This falls under the topic of "different distribution of expression levels for those transcripts". While the overall shape of the distribution does change somewhat from sample to sample, it is I think always the case that you will "lose" a large fraction of your reads for a relatively small number of highly expressed genes. Its probably the single biggest reason why we need so many reads to properly cover the transcriptome. This effect can be mediated somewhat with cDNA capture or other strategies but that is probably a topic for another post.