Hi all,
I have a lot of (order of a million) accession codes that I need to fetch info of from genbank. I have found these by blasting nucleotide sequences against the nt
database. I'd like to know whether they have associated protein sequences, and if so, fetch them. I'm currently doing this using biopython's Entrez module.
I was hoping to process the codes in parallel in batches of 100, because fetching info about one code can take up to a minute, but I'm limited by 3 queries per second. Further, I have to fetch the entire info about a given entry, while sometimes knowing the organism and protein name, if any, would allow me to filter out the codes that will not be helpful.
I also have access to a local copy of BLAST+ and the most up-to-date databases. I can use blastdbcmd
to get the organism name and nucleotide sequence, but this doesn't give me info about whether the associated protein sequence is known.
What are my options at this point?
Thanks!
Thanks! I should have blasted my nucleotides against
nr
, but I have already processed a ton by blasting againstnt
... It's a learning process I guess :)I will try downloading the database and searching through it locally.