Hello.
I have several files (>200) with more than 100 protein sequences each and I want to get the taxonomy of each sequence. My first thought was using BLAST. Since they are quite a few, I used a PERL scrip to "remote blast" them and then extracted the accession of the best hit. I've tried using EUtils to retrieve the taxonomy, but it's really slow. I've also tried to use Bio::LITE::Taxonomy package, but it only uses GI (not the accession), and NCBI doesn't use it anymore. My other thought was a standalone BLAST, but the nr database is really big and making it takes a lot of time.
Does anyone know a better way to get the taxonomy of a protein sequence or from the accession? If not, I'll stick to EUtils.
Thanks!!
You can download pre-formatted nr database files from NCBI. It is still a big download. I am not sure if NCBI actually includes taxonomy info in their pre-formatted blast indexes.
Have you looked at this solution on Stackoverflow?