We are about to do our first RNA-seq (ABI Solid, mouse mRNA) experiment with our local genome core. They're new to this as well. I have to answer the question, "how many reads do you want per sample?". The core has an estimate of the percent of reads likely to map from ABI, and has derived an estimate of how many mapped reads to expect per lane. The samples are normal epithelium, triplicates or quadruplicates of three distinct strains of mouse, so 9 or 12 individual samples. We will be comparing results to existing microarray results from the same animals.
Are there any well-founded rules of thumb to answer this question? Any advice, either informal or published?
It depends on what you are after (or believe you are after). The larger the number of reads:
- the highest the precision for detecting differential expression between samples (for lower abundances)
- the better variants can be detected and quantified (if heterogeneous population of cells)
- the largest the number of different transcripts the larger the number of required reads to quantify everything
Usually the throughput of a machine (Illumina, SOLiD, 454) is given in billion of bases (Gb) sequenced per run/lane. Obviously, for a fixed throughput in Gb, the longer the reads the lower the number of reads. Take a read lenght you find acceptable, how much you can afford per sample, and start with that.
Thanks for the advice; didn't expect a magic "oh, you'll need 1.72 million reads for that design". Didn't know if this really entirely emperical at this point, or if there is some received wisdom about read depths.
I swear if someone figured out an equation for this they'd have a very well sited paper. I'm in the mist of performing RNA-seq analyses and one of the initial questions we had when we started was the same as yours. I work in yeast, so I sorta just figured if we use 35 bp reads (ABI SOLiD) how many would it take to cover the whole transcriptome, then how much depth we would want. It ended up not even mattering because I had so much rRNA contamination (we couldn't use the ribominus kit due to the way we were extracting the RNA) that we ended up with only 5% of what we expected. It was actually enough to get some useful data, however we are repeating the experiment to get more reads in order to be sure what we found was true. I would read lots of papers similar to what you want to do in order to have a rough estimate. Also, if you know of certain genes that are supposed to be up/down regulated that would serve as a nice positive/negative control.
I mentioned I work in yeast, however I have come across mice RNA-seq papers that definitely didn't use the method I used to make an estimate, because they used maybe twice the reads I used when the transcriptome should be much larger than 2x. Also you say you want to compare it to a microarray data set. If you're referring to an expression array and not a tiling array you may need only minimal depth to just know whether a gene is on or off. Sometimes papers use the absolute max number of reads because they want to know where the transcript starts and stops but you may not need that type of coverage.
I think it's really hard to say how many reads you need. Ribosomal RNA contamination can really effect an RNA-seq experiment.