I have assembled a transcriptome using TRINITY with 100,000 transcripts. I have perform blastx against NR database (-outfmt 5). I want to remove redundancy in the assembled transcript before proceed to further processing.
1) How can I cluster them together into unigenes and remove redundancy?
2) Or how to select transcipts with longest sequences if they return same hit?
3) Any software or program suggestion for doing this?
4) Is it necessary for removing redundancy?
Thank you very much.
See also the Trinity FAQ: https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-FAQ#ques_why_so_many_transcripts