Question

What Does The Library Size In A Short Read Experiment Depend On?

0

Entering edit mode

13.1 years ago

Untom ▴ 420

Hi there!

I was wondering what the total library size (i.e. the total number of reads) in a short read experiment depends on? Looking at various datasets in the SRA I noticed that the different submissions had vastely different numbers of reads per lane. What determines this number? How influencial are the following factors:

The version of the machine (older machine => less reads)
The time the machine is left running (longer runs => more reads)
The total amount of RNA put into the lane
length of the reads (if I want longer reads, I'll get less reads)

short read coverage • 2.2k views

ADD COMMENT • link updated 11.4 years ago by Biostar 20 • written 13.1 years ago by Untom ▴ 420

score 2 · Answer 1 · 2012-02-29

Depending on the sequencer you are using there are many factors that will determine total number of reads generated. It is important to note that number of usable reads is more important than total number of reads.

Version of the machine: Older machines might not be able to deal with the density of beads/clusters on the chip either due to the optics or signal intensity. This could affect number of reads.
Time the machine is left running: This should not influence number of reads as you try to have a mono-layer of beads/clusters on a chip for the camera to pick up the signal. Time of the run would only be a factor in the desired length of the read. Most current generation sequencers have a length limit before quality degrades.
Total amount of RNA put into the lane: RNA is not put into the lane. Fragmented cDNA with adapters made from RNA are put into the lane. Depending on your library prep (poly-A enrichment, ribo depletion), you could have less than optimal amount of sequences put into the lane.
Length of the reads: This goes back to the time the machine is left running question. Your desired length should not affect number of produced reads.

score 1 · Answer 2 · 2012-02-29

Throughput is dependent mainly on the type of technology (e.g., Illumina vs. 454). Yes, throughput of older machines is generally lower, though it depends on the specific application. For example, the output of a new MiSeq is lower than a HiSeq 2000, but the reads are longer. There can be stochastic variation in the amount of data from a run due to problems with library construction (amount and quality of DNA) or the operation of the instrument. The sequencing center really should be able to minimize the issues with the libraries, though there can be issues with the sequencers from time to time (I've seen quite a bit of variation from 454 runs).