Hi all!
I'm creating a blast database using:
makeblastdb -in proteins.fasta -dbtype prot -parse_seqids -out my_protein_db
I was trying to extract some sequences from this using blastdbcmd but kept getting error messages of "Entry not found".
My entries look like this: (there is 1 pipe in each entry): ABC|DEF60375.1 EHL|XP_003887.1
However if i do check the identifiers in my database using:
blastdbcmd -entry all -db my_protein_db -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
I get lines like this:
> OID: 0 GI: N/A ACC: ABC|DEF60375.1 IDENTIFIER: gnl|ABC|DEF60375.1
> OID:0 GI: N/A ACC: EHL|XP_003887.1 IDENTIFIER: lcl|EHL|XP_003887.1
so it seems NCBI has added some text+a pipe infront of my identifiers, I can just concatenate these additional letters onto my entries when I use blastdbcmd, however I noticed that these letters are not always the same, for some cases it is "gnl|" and others it is "lcl|". Does anyone know how NCBI decides this naming convention? and whats the best way to get around this?
Thanks very much for any input
What do fasta headers in your
proteins.fasta
look like?grep "^>" | head -3
?like this:
Which version of blast are you using?
See this page for additional detail.
Those are NCBI standard fasta identifiers.
blast+/2.6.0
I will check them out, thanks!