extracting sequences from a blast database
0
2
Entering edit mode
5.6 years ago
max_19 ▴ 170

Hi all!

I'm creating a blast database using:

makeblastdb -in proteins.fasta -dbtype prot -parse_seqids -out my_protein_db

I was trying to extract some sequences from this using blastdbcmd but kept getting error messages of "Entry not found".

My entries look like this: (there is 1 pipe in each entry): ABC|DEF60375.1 EHL|XP_003887.1

However if i do check the identifiers in my database using:

blastdbcmd -entry all -db my_protein_db -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"

I get lines like this:

> OID: 0 GI: N/A ACC: ABC|DEF60375.1 IDENTIFIER: gnl|ABC|DEF60375.1 

> OID:0 GI: N/A ACC: EHL|XP_003887.1 IDENTIFIER: lcl|EHL|XP_003887.1

so it seems NCBI has added some text+a pipe infront of my identifiers, I can just concatenate these additional letters onto my entries when I use blastdbcmd, however I noticed that these letters are not always the same, for some cases it is "gnl|" and others it is "lcl|". Does anyone know how NCBI decides this naming convention? and whats the best way to get around this?

Thanks very much for any input

sequencing genome protein blast • 2.7k views
ADD COMMENT
0
Entering edit mode

What do fasta headers in your proteins.fasta look like? grep "^>" | head -3?

ADD REPLY
0
Entering edit mode

like this:

>
MKFSTLLKSNKLQGWEDFYIQYDNLIKYLKTDPLKFKNLLIKENTKITTFFNEIEEQANQQKNELLMLVKNNLIYDSSTK
YKNFKDKLYQNELID
ADD REPLY
1
Entering edit mode

Which version of blast are you using?

See this page for additional detail.

Those are NCBI standard fasta identifiers.

ADD REPLY
0
Entering edit mode

blast+/2.6.0

I will check them out, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6