Question

Duplicated Reads In Rna-Seq Experiment

11

Entering edit mode

13.4 years ago

Steffi ▴ 590

How many reads (percentage) would you expect to be exactly equal? And to what extent? I just came across an experiment where out from 100 x 10^6 reads, just 60 x 10^6 reads are unique.

In my case, it is rna-seq of a murine adipocyte cell line. Illumina HiSeq, paired end reads, read length = 100. Standard TruSeq protocol.

Actually I do think it is due to some error in the library preparation. Even if there are a couple of genes that are very high expressed and hence are the origin of most of the reads, the reads should not be necessarily exactly equal.

I just wonder what to do with such data. First of all it is important that one is aware that such things could happen. Furthermore, if some genes are so overrepresented, other genes might be heavily under-estimated. All together, just be aware that you probably don't have a fair representation of the mRNA landscape in your sample.

Moreover, one could reduce the computational cost of the mapping dramatically if one deals with such data and restrict the data to just unique reads.

rna duplicates • 28k views

ADD COMMENT • link updated 13.4 years ago by Malachi Griffith 20k • written 13.4 years ago by Steffi ▴ 590

1

Entering edit mode

What sequencing platform and libraries you are using?

ADD REPLY • link 13.4 years ago by Damian Kao 16k

1

Entering edit mode

And what species?

ADD REPLY • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

I agree with these comments, more info would be useful. Such as: source tissue, species, RNA isolation method, polyA+ selection?, library construction method, read length?, paired vs. unpaired?, etc.

ADD REPLY • link 13.4 years ago by Malachi Griffith 20k

score 46 · Answer 1 · 2011-11-15

46

Entering edit mode

13.4 years ago

Malachi Griffith 20k

Observing high rates of read duplicates in RNA-seq libraries is common. It may not be an indication of poor library complexity caused by low sample input or over-amplification. It might be caused by such problems but it is often because of very high abundance of a small number of genes (usually ribosomal or mitochondrial house keeping genes). For example, I have seen libraries where ~60% of all reads mapped to the 2-10 most highly expressed genes. Sometimes 75% of all reads map to the top 0.1% of expressed genes. The result of such heavy sampling of these genes is a high number of duplicate reads (even when considering read pairs in assessing duplicates).

This is not necessarily an artifact. Some RNA samples really do have such a skewed distribution of expression values. If the highly expressed genes are non-coding RNA genes, an efficient polyA selection before library construction will often dramatically improve the situation. Other options include a variety of library normalization techniques or simply ignoring the phenomenon. Since you have a finite number of reads in the library, if you are burning most of them on a small number of highly abundant genes you may need to sequence deeper to get good representation of the rest of the transcriptome...

Some RNA-seq software will consume vast quantities of memory when such genes are encountered. For practical reasons some filtering of the data may be required to remove highly represented sequences.

ADD COMMENT • link 13.4 years ago by Malachi Griffith 20k

3

Entering edit mode

This is actually a very good explanation.

ADD REPLY • link 13.4 years ago by Michael 55k

1

Entering edit mode

Dear Malachi,

now after five years of your answer, and given the new established sequencing methods, how do you look at duplicate reads now? Have you seen this paper?

ADD REPLY • link 8.1 years ago by H.Hasani ▴ 990

0

Entering edit mode

(I would like to hear the answer to this as well.)

ADD REPLY • link 7.7 years ago by eric.kern13 ▴ 240

4

Entering edit mode

I was a PhD student when my now fellow moderator posted his answer. I am not sure why you need follow-up to this. In the Conclusion of the manuscript that you have linked, it is stated:

We conclude that computational removal of duplicates is not recommendable for differential expression analysis and if sufficient starting material is available so that only few PCR-cycles are necessary,

This matches the general consensus of the Biostars community, i.e., not to remove PCR duplicates for RNA-seq.

My own criticism of the manuscript is that they used ERCC-spike-ins and that they don't appear to provide any concrete evidence / proof for the statements that they make. It's a neat study, though. Still, their conclusions match the community's consensus.

Kevin

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k