I use trinity/Oases for de novo transcriptome assembly. In my pipeline, I remove exact duplicate reads(forward and reverse strands) because I believe that duplicates don't add any information to the assembly and it reduces the input size and thus expedites the downstream analysis. But is my assumption correct?
I am also confused about this because in the oases paper, the authors say "assemblies with longer k vlaues perform best on high expression genes, but poorly on low expression genes" (http://bioinformatics.oxfordjournals.org/content/early/2012/02/24/bioinformatics.bts094.short). But if we remove duplicates and thus only have unique set of reads, don't we lose the expression value?
Indeed, I'm not sure about the answer, but I do think removing duplicates would affect your estimates of expression levels. Therefore if the de novo assembler happens to use expression levels as sort of support (by mapping the reads back to the contigs) to assembled contigs, it may actually affect your assembly.