We have sequenced a transcriptome of a species who does not have a sequenced genome, using 454 and our initial goal is to find a set of ESTs that represent genes. The 454 reads were assembled using Newbler 2.5 and the initial assembly gave ~26000 isotigs and 18,000. Contigs. After talking to the several people, I used CD-Hits program to combine the isotigs, contigs and singltons that were not assembled. After combining these sequences, I got ~4000 isotigs, ~17,000 contigs and ~30, 000 Singlton that were not assembled. Is this the correct way to do this? I couldn’t find any publication that has mentioned this method.
Which identity thresholds did you use?
Algorithms for CD-HIT were described in three papers published in Bioinformatics.
Please check these papers about CD HIT