Hello,
I am wondering how big should be the library for each sample to big enough to detect meaningful differential expression please? Is 15M reads or 25M too small?
I found this paper from 2011 that discuss this matter and it seems that when one is mainly interested in protein coding genes, the depth do not have to be very big: http://genome.cshlp.org/content/21/12/2213.full. But it was written 4 years ago. Is it still true please?
What is your experience please?
Many thanks
Upvoted for asking the question before doing the sequencing rather than afterwards.
Here are two newer papers for your reference (Sims et al., 2014; Williams et al., 2014). Basically, about 30 million reads for each sample may be sufficient for gene-level differentially expression analysis. However, it totally depends on your biological question, experimental design, and data quality. For example, it could be need about 80 million mapped reads to identify low expressed gene (fewer than 10 FPKM) and > 200 million paired-end reads to detect novel, rare transcripts. Please see the more detail discussion in their papers.
Based on my experience (most are chicken gene-level differentially analysis using single-end 50 or 75 n.t. reads), I feel that 20 million reads is accepted, 30 million reads is good enough, and 40 million reads is the maximum what we need. Furthermore, I feel that more biological replicates could be more important than more reads.
Sims et al 2014.,: Sims, D., Sudbery, I., Ilott, N.E., Heger, A. & Ponting, C.P. (2014). Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121-132.
Williams et al., 2014: Williams, A.G., Thomas, S., Wyman, S.K. & Holloway, A.K. (2014). RNA-seq data: Challenges in and recommendations for experimental design and analysis. Curr Protoc Hum Genet 83, 11 13 11-11 13 20.
Hello,
May I ask a stupid question please? I would like to double check. When someone talk about the "size of the library", does he means the number of forward reads PLUS the number of reverse reads? Or does he mean the number of templates (i.e. == Number of forward reads) please?
Usually when I've heard people say "size of the library" they are referring to the number of unique molecules that went into the sequencing library.
So if you had two pieces of RNA that you were sequencing, you could PCR the hell out of them and then get 1 million paired end reads. So you would have 1 million reads and 2 million read fragments and ~2 million lines in your bam file, but your library size would still be just 2 with a duplication rate just under 100%.
In our lab we generally do 65-70 million paired end reads (100bp, human genome) and so far we are happy about the coverage and data (mainly for quantification and splicing events). For known protein coding genes (eg ensemble gtf) this is good enough. But as suggested above, if you are against novel transcripts its better to do ultra deep.