Hi, I just finished my transcriptome assembly using Trinity. However, the transcripts produced by trinity is too many (~300k transcripts) which is not normal for my sample. I believe most of these transcripts are redundant. How can I remove these redundant transcript?
1) I already tried cdhit est. Unfortunately the output still contains many redundant transcript
2) I also already tried corset and follow the tutorial here (https://github.com/Oshlack/Corset/wiki/Example). However, currently I am stuck on how to recover the unigenes sequence from the corset output
3) I planned on trying to use TGICL to further remove redundant sequence from CD-hit output as done by some studies. However, I am a bit not familiar with TGICL and dont know which parameter to use
It would be happy me if somebody could help with my problem. Thanks
Which organisme are you working in?
I always find it helpful to map the transcripts and view them in a genome browser. I find gmap to be the best mapper: Example command - might be out of date: gmap -f gff3_gene -D /lager2/rcug/seqres/HS/gmap/hg19_gmap -d hg19_gmap -B 5 -t 16 --intronlength=150000 --totallength=1000000 --npaths 1 -p 3 in.fa > in.fa.gff3