I am fledgling PhD student trying to design some KO vs WT RNA-seq experiments (initially just coding, with potential to look for novel splicing, low abundance and non-coding transcripts in the future).
I have been using a couple different power calculating tools to help inform the trade off between depth and biological reps (Scotty (Busby, et al 2013), RNASeqPower (Hart, et al 2014) and looking at as much literature on the topic as I can. A lot of the focus is on poly-A enriched libraries and thus guide sequencing depth using such approaches (as far as I could see, Scotty doesn't specify either poly-A or total RNA-seq).
Does anyone know of a good way to scale power calculations based on poly-A enriched experiments to total RNA-seq (rRNA depleted)? i.e. what is the read-depth required for a total-RNA library to ensure min 7 reads per coding-gene?
Cheers
In my opinion running small scale pilot studies are the way to go. Theoretical assumptions take you only so far.
As to the requirement of having 7 reads per coding gene - it is not clear what you mean there. Some genes will not express at all.
And finally a personal opinion: don't fall into the trap of assuming that whatever you compute by a formula will match the observed biological phenomena. Natural biological variation can be far more substantial than expected, a problem exacerbated by having very little data (low abundance). Biological replication is key.
But like I said before, run pilot studies, multiple of them and that will help you hone in on the correct parameters.
Thanks Istvan. Yeah we have pilot data for n=8 samples (4 WT, 4 KO) sequenced at ~20million reads that I have been using with the modelling tools. You're right, I don't think these algorithms are truly informative without it.
Yeah I appreciate some genes won't be expressed and I definitely didn't word the question clearly. Perhaps this will make my question clearer. When you use tools like the one described by Hart et al, they estimate that 0.1 reads will map to 85-97% of genes per million reads produced. But this estimate is for ploy-A enriched sequencing that is mapping to a very small proportion of the genome (coding sequence is ~2% of genome right?). I was wondering if similar calculations have been made when sequencing libraries of total RNA (rRNA depleted) as these will be mapping all over the genome and thus you would need more reads to get the same level of coverage over any region of interest. Is there a rule of thumb to scale depth when moving from poly-A enriched to total-RNA?