Question

RNA-seq: how big should be the library?

8

Entering edit mode

10.3 years ago

Aurelie MLB ▴ 360

Hello,

I am wondering how big should be the library for each sample to big enough to detect meaningful differential expression please? Is 15M reads or 25M too small?

I found this paper from 2011 that discuss this matter and it seems that when one is mainly interested in protein coding genes, the depth do not have to be very big: http://genome.cshlp.org/content/21/12/2213.full. But it was written 4 years ago. Is it still true please?

What is your experience please?

Many thanks

RNA-Seq • 4.5k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Aurelie MLB ▴ 360

4

Entering edit mode

Upvoted for asking the question before doing the sequencing rather than afterwards.

ADD REPLY • link 10.3 years ago by Daniel ★ 4.0k

1

Entering edit mode

Here are two newer papers for your reference (Sims et al., 2014; Williams et al., 2014). Basically, about 30 million reads for each sample may be sufficient for gene-level differentially expression analysis. However, it totally depends on your biological question, experimental design, and data quality. For example, it could be need about 80 million mapped reads to identify low expressed gene (fewer than 10 FPKM) and > 200 million paired-end reads to detect novel, rare transcripts. Please see the more detail discussion in their papers.

Based on my experience (most are chicken gene-level differentially analysis using single-end 50 or 75 n.t. reads), I feel that 20 million reads is accepted, 30 million reads is good enough, and 40 million reads is the maximum what we need. Furthermore, I feel that more biological replicates could be more important than more reads.

Sims et al 2014.,: Sims, D., Sudbery, I., Ilott, N.E., Heger, A. & Ponting, C.P. (2014). Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121-132.

Williams et al., 2014: Williams, A.G., Thomas, S., Wyman, S.K. & Holloway, A.K. (2014). RNA-seq data: Challenges in and recommendations for experimental design and analysis. Curr Protoc Hum Genet 83, 11 13 11-11 13 20.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.3 years ago by Gary ▴ 480

0

Entering edit mode

Hello,

May I ask a stupid question please? I would like to double check. When someone talk about the "size of the library", does he means the number of forward reads PLUS the number of reverse reads? Or does he mean the number of templates (i.e. == Number of forward reads) please?

ADD REPLY • link 9.4 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

Usually when I've heard people say "size of the library" they are referring to the number of unique molecules that went into the sequencing library.

So if you had two pieces of RNA that you were sequencing, you could PCR the hell out of them and then get 1 million paired end reads. So you would have 1 million reads and 2 million read fragments and ~2 million lines in your bam file, but your library size would still be just 2 with a duplication rate just under 100%.

ADD REPLY • link 9.3 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

In our lab we generally do 65-70 million paired end reads (100bp, human genome) and so far we are happy about the coverage and data (mainly for quantification and splicing events). For known protein coding genes (eg ensemble gtf) this is good enough. But as suggested above, if you are against novel transcripts its better to do ultra deep.

ADD REPLY • link 10.3 years ago by poisonAlien ★ 3.2k

Ram · Answer 1 · 2015-04-30

3

Entering edit mode

10.3 years ago

Ryan Dale 5.0k

See section III.2 of the ENCODE project's RNA standards, from the ENCODE experiment guidelines page. Minimum depth depends on a number of things (genome, read length, goal of the experiment, experimental protocols like depleting ribosomal RNA, etc), but they provide some useful rules-of-thumb.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Ryan Dale 5.0k

Ram · Answer 2 · 2015-05-01

There is no general rule except that you will always get more power from using more replicates than from using that same number of reads to sequence the existing samples more deeply. This is because sequencing deeper only reduces the problem of counting noise. An increased number of biological replicates reduces the problem of counting noise and biological variance. Then it becomes an optimization problem based on how much you want to spend.

We have a tool here and an accompanying blog here.

Ram · Answer 3 · 2015-04-30

1

Entering edit mode

10.3 years ago

Antonio R. Franco ★ 5.2k

It mainly depends upon the size of your genoma. Notice that 15M reads would not be the same when dealing with a bacteria or a human genome

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Thank you! This is a good point. I had the human genome in mind.

ADD REPLY • link 10.3 years ago by Aurelie MLB ▴ 360