Hi everyone,
I want to make a blast database of insect proteins to locally blast my transcriptome assembly. I dowloaded all the accession numbers associated with insects from the ncbi website. Next, I used this command to retrieve the associated fasta files from my locally installed nr ncbi database.
blastdbcmd -db /home/db/ncbi/nr -entry_batch protein_result.txt -out insects_seq.fa
This however gives me incomplete output - a lot of accession numbers were not found: e.g. Error: CAB42201.1: OID not found
Moreover, I get a lot of multi headers entries in the output file: e.g.
>gi|1080121958|gb|AOW70003.1| arginine kinase, partial [Remella rita] >gi|1080122062|gb|AOW70055.1| arginine kinase, partial [Xenophanes tryxus]
EEKVSSTLSGLEGELKGTFYPLTGMSKQTQQQLIDDHFLFKEGDRFLQAANACRFWPTGRGIYHNENKTFLVWCNEEDHL
RLISMQMGGDLKTVYKRLVTAVNDIEKRIPFSHNDRLGFLTFCPTNLGTTVRASVHIKLPKLAADKAKLEEVASKYHLQV
RGTRGEHTEAEGGVYDISNKRRMGLTEYDAVKEMYDG
Is there a way to avoid both issues?
Thanks a lot in advance! Janne
I can reproduce the second example posted above (with blast+, v.2.5.0) and can recover the same sequence entry using either of those accession numbers independently with
blastdbcmd
.Edit: Examining those two individual entries (at NCBI) confirms that the sequences for those are identical. So NCBI is perhaps saving space by including both headers and a single copy of the sequence? That seems to be only logical explanation.
Edit 2: Having two headers like that in a single entry is going to further mess up FASTA format.
You may want to confirm by emailing BLAST support.
Hi Janne,
Have you solved this issue?