Question

How Much Rnaseq Data Do You "Throw" Away?

3

Entering edit mode

14.3 years ago

Pfs ▴ 580

I was told that I can safely ignore all the reads that map exactly at the same genomic location, i.e. if I have 100 reads all mapping at pos x and with the mate at pos y, only one should be considered, since it is just result of amplification.

I am not sure why that is.

Can anybody explain this?

Sorry if I don't have more specific details.

rna • 2.4k views

ADD COMMENT • link updated 14.3 years ago by seidel 11k • written 14.3 years ago by Pfs ▴ 580

score 7 · Answer 1 · 2011-04-10

Every DNA sequence has a slightly different amplification efficiency. So it's possible for some molecules to become over-represented as a function of that bias, rather than because that sequence count actually reflects the biological concentration of that sequence relative to other sequences. On the other hand, shearing of DNA is also non-random, so it's not completely unexpected that many reads could come from a given pos x, especially if you have a lot of depth. However, you would expect those reads to have different y positions (i.e. paired end). If both x and y are the same, that's a sign of amplification bias, or perhaps a low complexity library.

Conceptually you might consider that in sampling a good library you're actually sampling the concentration of shear events at positions in the genome. High abundance molecules will provide a higher density of shear events (or fragmentation events) for a given locus because there are more opportunities to sample the space, than low abundance molecules. If you are reading the same event more than once from a locus, it's either a sign that it's over-sampled (i.e. really deep sequencing) or that's sampled in a biased way.

It's like breaking spaghetti. Break 3 pieces of spaghetti and count break points, then break 100 pieces of spaghetti and count break points. If someone had you estimate spaghetti abundance this way, you'd be suspicious if many of the fragments from the little 3 piece pile were all the same size (i.e. like something was fudging the spaghetti data).

score 3 · Answer 2 · 2011-04-10

There are many types of artifacts when doing next-gen sequencing, and one of these is that a small number of reads will often get amplified hundreds or thousands of times before sequencing. If you rely on these reads for expression calling in RNAseq (or copy-number/allele frequency calling in DNA), you'll get a grossly skewed view of how abundant that particular sequence is.

Given the enormous number of potential start positions for a read, the assumption we make is that it's very unlikely that two reads will start at exactly the same genomic position. This isn't a perfect solution, as if you have very deep sequencing, you'll undoubtedly end up throwing away a few reads that aren't PCR artifacts. This is a much smaller source of error than the amplification artifacts would be, though.