I de novo assembled several transcriptomes of the same organism and found that with the increase of reads (samples), the size of resulting assembly is larger and larger. But to my knowledege, this should contain redundant transcripts, right? what I want to ask is why this happens and how to remove these redundant transcripts. One more concern is that when removing redundance, is it possible that we lose some genes of the same family or the following quantification steps can be disturbed within the same family? As far as I know, there are following steps that may help: when assembling, use --normalize_reads to limit max read coverage and after trinity assembly, use Tgicl to extend the transcripts and use cd-hit to remove highly similar sequences. Are there some other effective tools or strategies that can help with this?
Thank you. I think my transcriptome is less than 200MB. Maybe it's better to use cd-hit-est to remove redundant transcripts, right? But when using it which threshold can be taken to be redundant, 0.95 or higher? Shall I extend the contig before I remove the redundant ones?