The use of fastacmd
and blastdbcmd
suggests you are trying to get the UniProtKB sequences from an NCBI BLAST database. Depending on how the database was constructed look-ups using the various identifiers may or may not work.
Firstly the NCBI BLAST database needs to have been build with indexing of the sequence identifiers enabled (i.e. with -oT
for formatdb
or -parse_seqids
for makeblastdb
). The BLAST databases provided on the NCBI's FTP site should all have this enabled, but for other NCBI BLAST databases this may not have been enabled when the database was created.
For the 'nr' BLAST database provided by NCBI look-ups are supported using all the entry identifiers appearing in the fasta header line. So for UniProtKB:WAP_RAT the 'nr' fasta header line is:
>gi|139691|sp|P01174.2|WAP_RAT RecName: Full=Whey acidic protein; Short=WAP; AltName: Full=Whey phosphoprotein; Flags: Precursor >gi|5679681|emb|CAA25600.2| whey acidic protein [Rattus norvegicus]
Which means we can search 'nr' with:
1. NCBI gi number:
blastdbcmd -db nr -dbtype prot -entry '139691' -get_dups
blastdbcmd -db nr -dbtype prot -entry '5679681' -get_dups
2. UniProtKB accession:
blastdbcmd -db nr -dbtype prot -entry 'P01174' -get_dups
3. UniProtKB sequence version accession:
blastdbcmd -db nr -dbtype prot -entry 'P01174.2' -get_dups
4. UniProtKB entry name aka. UniProtKB ID:
blastdbcmd -db nr -dbtype prot -entry 'WAP_RAT' -get_dups
5. INSDC protein_id:
blastdbcmd -db nr -dbtype prot -entry 'CAA25600' -get_dups
For BLAST databases which were built from fasta format data which used an alternative header format, for example a 'uniprotkb' BLAST database generated from the UniProtKB fasta files provided by EMBL-EBI which use the fasta header format:
>SP:WAP_RAT P01174 Whey acidic protein OS=Rattus norvegicus GN=Wap PE=1 SV=2
The support for parsing the identifier in NCBI BLAST can be insufficient. In which case the entries can only be retrieved by using the generic fasta identifier (i.e. first "word" on the header line):
blastdbcmd -db uniprotkb -dbtype prot -entry 'SP:WAP_RAT' -get_dups
The fastacmd
program works in exactly the same way, but the command-line syntax is a little bit different, for example fetching the example sequence from above using the UniProtKB sequence version uses the command-line:
fastacmd -d nr -pT -s 'P01174.2' -aT
Note: fastacmd
and blastdbcmd
support batch retrieval using a comma separated list of identifiers, so when fetching many entries you may want to batch them for efficiency reasons. The queries above use the -get_dups
or -aT
to allow for cases where an identifier may correspond to multiple sequences (shouldn't happen in these databases, but you never know).
If you do not have an appropriate NCBI BLAST database for these look-ups, then web based options such as those mentioned in the other answers (e.g. UniProt.org RESTful API, EMBL-EBI dbfetch, NCBI E-utils, etc.) may be more appropriate depending on how much of the database you need. Otherwise you may want to download the data, and appropriate indexing software (e.g. NCBI BLAST, EMBOSS, BioPerl, etc.) in order to perform the look-ups locally.
hey guys,
I saw the thread is little old but I wanna ask a question about uniprot fasta file header.
in the header shown as below:
which ones are UniProt IDs and which ones are accession numbers?
To get input, move your post as a new, separate question.