Question

Is It Safe To Remove Exact Duplicate Reads In The Denovo Transcriptome Assembly?

2

Entering edit mode

11.7 years ago

lwc628 ▴ 230

I use trinity/Oases for de novo transcriptome assembly. In my pipeline, I remove exact duplicate reads(forward and reverse strands) because I believe that duplicates don't add any information to the assembly and it reduces the input size and thus expedites the downstream analysis. But is my assumption correct?

I am also confused about this because in the oases paper, the authors say "assemblies with longer k vlaues perform best on high expression genes, but poorly on low expression genes" (http://bioinformatics.oxfordjournals.org/content/early/2012/02/24/bioinformatics.bts094.short). But if we remove duplicates and thus only have unique set of reads, don't we lose the expression value?

transcriptome trinity duplicates • 6.3k views

ADD COMMENT • link updated 11.7 years ago by swbarnes2 14k • written 11.7 years ago by lwc628 ▴ 230

0

Entering edit mode

Indeed, I'm not sure about the answer, but I do think removing duplicates would affect your estimates of expression levels. Therefore if the de novo assembler happens to use expression levels as sort of support (by mapping the reads back to the contigs) to assembled contigs, it may actually affect your assembly.

ADD REPLY • link 11.7 years ago by Vitis ★ 2.6k

score 5 · Answer 1 · 2013-04-17

5

Entering edit mode

11.7 years ago

swbarnes2 14k

There will always be some number of duplicates that are PCR artifacts, and some number that are "real", that is, you were unlucky, and two distinct molecules of DNA broke in exactly the same way.

Keeping the former in overestimates transcript abundance, getting rid of the latter underestimate apparent abundance. So the question is, of all the duplicates you see, what is the ratio of PCR artifacts to genuine separate, but identical reads?

Unless your coverage is extremely high, I think most of your duplicates will be the former, so getting rid of them will give you more accurate results. Only once coverage starts going up to hundreds do genuine identical looking molecules start being independently generated.

Or to put it another way, removing duplicates puts a hard ceiling on the maximum coverage you can possible get for a given sequence. If your sequence has a higher coverage than that ceiling, you will lose your ability to know exactly how high. But that ceiling is awfully high, and likely duplicate removal is the right thing to do for samples whose coverage is well below that ceiling.

I figure that an assembler wants all the regions in a contig to have about the same coverage, so if PCR duplicates are throwing that off, fixing that is probably the right thing to do.

ADD COMMENT • link 11.7 years ago by swbarnes2 14k

0

Entering edit mode

For mRNA-Seq experiments, the ceiling could be really high, as the dynamic range of gene expression is quite big. Also, I have never seen even coverage for a transcript in mRNA-Seq experiments, there are always highs and lows even within an exon. I think both made de novo assembly of transcriptome harder than genome.

ADD REPLY • link 11.7 years ago by Vitis ★ 2.6k

0

Entering edit mode

so if I do the de novo transcriptome assembly, is duplication removal recommended or no?

ADD REPLY • link 11.7 years ago by lwc628 ▴ 230

1

Entering edit mode

Looks like there is no good answer to it: do two assemblies, with or without the duplicates and compare the contigs, if you happen to have some sort of annotation to work with (usually not, otherwise you wouldn't be doing de novo assemblies), it will be better to evaluate the de novo assemblies.

ADD REPLY • link 11.7 years ago by Vitis ★ 2.6k