Hello to all, I have downloaded transcription factors sequences of Tribolium castaneum from two database(DB) which are DBD: Transcription factor prediction database(http://www.transcriptionfactor.org/index.cgi?Home) and a database of metazoan transcription factors and maternal factors (http://www.bioinformatics.org/regulator/page.php). from the former DB I got ~620 sequences, and 519 sequences from the later one . and then I blasted the sequences of these two file, around 70% of the sequences have the similarity higher than 75% . then I think there might be certain amount of protein sequences in the two file which represent the same transcription factors. I want to use these Tribolium castaneum TFs as a reference for my insects transcriptome data, so here I wanna remove these redundant sequences and keep the unique TFs sequences. I used cd-hit-est, and it did not cluster any of these sequences. now I am going to blast all the sequences with NCBI nr base, and then delete the duplicated ones according their annotations.
My question here is can I do better than this? If I do how? Could you please give me some suggestions here?
Thanks
If you have protein sequences, you don't want cd-hit-est which is for nucleotide sequences. cd-hit should work.