I tried to use CD-HIT-EST to remove redundancy in trinity de novo transcripts however I can still see redundant annotations,
cd-hit-est -i trinity.fasta -o clstr_out -c 0.9 -n 9
for example:
TRINITY_DN1855_c5_g1, TRINITY_DN1855_c1_g1
all pointing to dnaK, the two sequences aligned at 92% identity but they are not clustered by CD-HIT
>TRINITY_DN1855_c5_g1
CGCCAAGAAGACCGAGATCTACAGCACCGCCGAAAACAACCAGCCCGGTGTGGAAATCAACGTGCTGCAAGGCAAGCGCC
CCATGGCCGCCGACAACAGGTCCCTGGGCCGCTTCAAGCTCGAGGGCATTCCCCCCATGCCCGCAGGCTGCGCCCAGATC
GAAGTGACCTTCGGTATCGACGCCAACGGCATTCTGCATGTCACCGCCAAGGAAAAGACCAGCAGCAAGGAAAGCAGCAT
CCGCATCGGGAACACCACCACCCTCGACAAGAGTGACGTGGAGCGCATGGTGCAGGAAACCGAGCAGAACGCCGCCGCCG
ACAGGGCCCGCAAGGAGAAGGTCGAGAAACGCAACAACCTCGACTCGCTGCGC
> TRINITY_DN1855_c1_g1
AGGGCGGCATGATTGCCCCGATGGTTACCCGCAACACCACCGTGCCCGTCAAGAAGACCGAGATCTACACCACTGCCGAAAA
CAACCAGCCCGGCGTGAAAATCAACGTGCTGCAAGGCGAGCACCCCATGGCCGCCGACAACAAGTCTCTGGGCCGCTTCAAGCTCGAAGGCGTTCCCCCCATGCCCGCAGGCCGCGTCCAGATCGAAGTGACCTTCGATAT
Trying other parameters as -c 0.89, 0.88 did not reduce the redundancy but actually increased the number of transcripts.
I am writing to hear your comments as to what the problem is and how to address the issue
Thanks,
Xp
See this thread for further assistance: how to use CD_HIT to remove the redundant sequence from trinity output file