Dear All,
We are going to do RNA-Sequencing using Illumina HiSeq for 200 samples. Given that the sample size is fixed, and the budget is fixed, the following 3 options were proposed.
- 50bp pair-end reads, sequencing each sample per lane --> we will get ~100 million reads per sample
- 75bp pair-end reads, sequencing two samples per lane --> we will get ~50-60 million reads per sample
- 100bp pair-end reads, sequencing four samples per lane --> we will get ~30-40 million reads per sample
Based on your experience, which option is the best or you have other suggestions? We would like to do different kinds of analysis for these data, i.e.,novel transcripts, lncRNA, and splicing, SNP, etc. You name it. If we have to sort them by priority (from high to low), I would like to say " novel transcripts, long-noncoding RNAs splicing and differential expression".
Currently, the majority of labs sequence 100bp pair-end, right? But I was told that even you sequence 100bp long, after 75bp, the sequencing quality is very bad due to the issue of sequencer itself, that is, it has nothing with the RNA quality of samples. If this is true, why is 100bp read length becoming more popular now?
Many thanks, Shirley
Well I cant give you the exact reasons but I would prefer option 2 i.e. 75 bp and 50-60 million reads per sample.
If you are interested in differential expression, I would suggest to do replicates for each sample. It will increase the cost but may you could reduce the number of reads per replicate. E.g. In your first point, you could have ~35 million reads for each replicate (considering 3 replicates per sample).
Vikas - are you referring to technical replicates or biological replicates? The latter is better I expect - but I've heard a lot of banter on the social wires / blogs lately regarding "more replicates for RNA-Seq" and when probed further some seem to mean more technical replicates (n>3) for the same biological replicate are needed for RNA-Seq. This reminds me of the same lessons learned years ago with microarrays - but in that case biological replicates are always preferred. Would RNA-seq be any different
Thoughts?
Hi Shirley and Jonathan,
If I were you I would maybe run 190 samples and use the leftover money to run some pilot data to answer that question.
We worked on the differential expression problem in our paper on power analysis for RNA Seq (http://euler.bc.edu/marthlab/scotty/scotty.php) but that only focused on differential expression. We did not look at read length. It is an interesting question. For differential gene expression I would expect (without having done the analysis) that more shorter reads will give more information because the reads will align fairly uniquely even at 50 bases and you will detect more rare transcripts with lower counting noise. However, there comes a point in detecting differential expression where you will have sequenced enough to have quantified all of your genes pretty well (with ~10 reads) and sequencing the same sample deeper is a waste of money. That point varies by species and by sequencing protocol so for determining that point we recommend using pilot data. You can run an analysis through Scotty if you have the pilot data. Scotty expects replicates but if you don't have replicates you can just run a rarefaction curve to see where in you samples you get 10 reads per gene. If you need help email us.
Regarding replicates, as a general rule you get more statistical power for differential expression by dividing a fixed number of reads into as many biological replicates as possible. Think of a million biological replicates with one read each as the ideal way to spend a million reads. But then you add the cost of a million library preps.
Daniel is right of course. Adding another technical replicate improves power by reducing your uncertainty about the true expression level, but only works to reduce uncertainty that is due to technical noise. Most of the noise is usually biological (unless you have difficult to sequence samples, or other special conditions). So adding another biological replicate reduces both technical and biological uncertainty, and is generally more helpful. If you are doing something like 200 cell lines it would be awesome to do at least 2 biological replicates of each cell line. No one ever does two replicates and it would be so much more useful data because you would be less likely to mistake systematic biological noise for a real effect.
The answer of read count versus read length gets more complicated when you want to detect differential expression at the transcript level, novel transcripts, lncRNA, splicing, etc. In those cases you want to bridge junctions, and longer reads are better at that. Then the question becomes where trade off is between read length and read number. Insert size will play a role there too.
If you can, I would try to address all of this empirically this with pilot data. Perhaps you can find where someone has made a big library with 100 bp paired end reads you can try some assemblies (or whatever) with different permutations of the data. That is, try an assembly with a subset of the long reads and see how that goes, and then use more but trim them shorter and see how that goes. It would be better to run the experiment on your own data, with a library of each type because there may be other artifacts in there, and the answer may be species-specific. You could probably even get a small methods paper out of the work, and you have to set up your analysis pipeline anyway.
That's my opinion, anyway.
Good luck!
Michele
Dear Michele,
We are working on 200 human samples collected from patients.
Thanks a lot for your detailed explanation and great suggestions. I will try Scotty and let you know if I need help:) I like your idea of either running some pilot data or finding available library with 100bp PE reads to answer my question. I am working on this now!
Thank you all for your great suggestions. I really appreciate. Shirley
Hi Shirley,
If Scotty chokes going up to that many samples let me know. I think with that many samples you will be able to detect very small fold changes if you don't have any batch effects or similar. I may need to do some reprogramming to get it to handle that many samples but I'm happy to do it.
Thanks, Michele
RNA-Seq should have low technical variability. Biological replicates would be preferred. Reduce the chances of lane bias artifacts by making sure you index and split your samples across multiple lanes. Your laboratory handling of the samples is more likely to introduce bias.