Is there any alternative for CD_Hit to remove redundancy from asemmbled trinity output file?
3
1
Entering edit mode
6.4 years ago

Hi everyone

I have a problem with reduction of redundancy from trinity output file. I have got an assembled fasta file from trinity containing 302000 contigs showing so much redundancy. I used CD_hit to remove redundancies and get unigenes. After using CD_Hit the number of contigs reduced to 240000 contigs showing lots of redundancies again. CD_Hit was not effective to achieve unigenes. Please give me advise how can I get unigens and remove redundancies?

Thanks

Assembly • 6.1k views
ADD COMMENT
0
Entering edit mode

Did you tweak the identity threshold using -c on the cd-hit?

ADD REPLY
0
Entering edit mode

When you say

240000 contigs showing lots of redundancies again

How do you verify that?

Try using TGICL

ADD REPLY
1
Entering edit mode
6.4 years ago
h.mon 35k

Trinity has a somewhat new script to construct "SuperTranscripts" based on the gene-to-isoform relationships and the sequence graph structure leveraged by Trinity during assembly. I think this will result in a better representation of unigenes than using cdhit.

$TRINITY_HOME/Analysis/SuperTranscripts/Trinity_gene_splice_modeler.py \
   --trinity_fasta Trinity.fasta
ADD COMMENT
1
Entering edit mode
6.4 years ago
Jake Warner ▴ 840

Getting 'unigenes' from Trinity assemblies is tricky business. I've found that Corset performs better than CD-Hit. Another idea is to BLAST all the transcripts and group them by reciprocal best blast hit.

ADD COMMENT
1
Entering edit mode

LACE and Corset are tools from the same group. Initially I thought LACE would be the preferred tool, as it was developed more recently, but I was wrong: according to one of the authors of both tools, they should be equivalent for the purpose of doing gene-level differential expression analysis. As the Trinity Trinity_gene_splice_modeler.py is based on the same algorithm as LACE, it should also be equivalent to Corset.

ADD REPLY
1
Entering edit mode
4.1 years ago
bcontreras ▴ 10

We have used our own https://github.com/eead-csic-compbio/get_homologues successfully. In fact we benchmarked against CD-HIT on https://www.frontiersin.org/articles/10.3389/fpls.2017.00184/full

I believe the main problem is that isoforms with different exons or retained introns are not properly handled by CD-HIT, but can clustered correctly with GET_HOMOLOGUES-EST in most cases

ADD COMMENT

Login before adding your answer.

Traffic: 2137 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6