I am ready to order mRNA sequencing on case and control rats for an experiment I am conducting measuring transcription levels of genes in rats. I'd like to identify differentially expressed (DE) transcripts and correlate mRNA transcript expression level with micro-RNA expression levels obtained in a previous experiment. I need to be able to detect differences in expression of about ~1.5x reliably from brain tissue (amygdala, etc.) I also need to be able to detect splice variants, which I think means I need longer reads and possibly paired end reads.
Like many researchers, I am limited by cost. In my preliminary studies, I think I have identified three key variables that I think will increase cost efficacy (and therefore statistical power) of my study, they are:
- I can increase sample size at the cost of trading-off read depth
- I can increase read length to get better mapping, but it will cost more
- I can opt for paired end reads (PE), which will cost more than SE but will map better and might help more with detecting splice variants.
Reading through the literature, I think what I have figured out is that it makes a lot more sense to do a higher number of cases and controls at lower depth for 1).
However, for 2) and 3), I cannot assess the degree of the trade off between single vs paired end read, or optimal read length that gives me combination of (mean mapping efficiency)(cost efficiency per read) = highest total mapped* reads per unit price. A complicating factor I don't understand well is how good the quality of the rat reference genome is, and how that will influence my mapping efficiency with respect to optimizing read length and SE vs PE.
Summary - If in the end I am interested in Differential Expression analysis primarily, but also want to be able to learn about splice variants, then what is the best combination of read depth, read length, and single vs paired ends? Therefore, ultimately my goal is to optimize statistical power to detect association of DE genes - what is the optimal experimental design?
I recognize there may be no "right" answer to this, but answers from your experience would be helpful to me. Thank you very much in advance.
If cost is the main driver here, I would avoid doing the splice variant analysis. Splice variant calling with Illumina data is very hit and miss and is hugely dependent on the reference having the splice isoforms in the first place. You will need significantly more reads to do splice variant analysis as each splicing event requires enough reads to be able to be counted as opposed to each gene needing enough reads. There are an order of magnitude more splice junctions in rat than there are genes. Hence more data required.
75pb or 100bp PE is the current sweet spot for RNA-seq DGE. Note it must be stranded. We usually do 5 or 6 reps per condition which gives very robust data for analysis.
Also, I'd recommend against ribodepletion protocols as you'll have lots pre-mRNAs in the sample which may muddy the gene counts.
Thanks for your sharing your experience with splice variant analysis. I have 0 experience with splice variant work and actually have no intention of doing those analyses myself. Right now the discussion revolved around do we want to spend a little extra money on longer PE sequencing on the chance that someone in our lab in the future could use this data set to ask those types of questions. I'm honestly still undecided on this front, the difference between 50PE and 100PE for my experiment is ~$1000. We may end up going for the 100PE just so we do not regret not having done it.
I also agree with you on not using ribodepletion protocols. My samples are of high quality so we are going to do poly(A) selection.
Thank you for sending the Liu paper, it is pushing me more towards prioritizing sample size over read depth. I've already shared it with several of my other colleagues as I think it will help them better design their own RNA-seq experiments.
The Rat BodyMap is another very interesting paper. It will help me a lot with my own research and and plans for data analysis. I believe the rat genome is sufficiently well annotated for the type of DE analyses I plan on doing. For example, in the BodyMap paper they report that they were able to uniquely map 88.5% of their reads using 50SE.