How To Get All Proteins Smaller Than 200 Amino Acids Out Of Ncbi Nr Database?
3
1
Entering edit mode
12.7 years ago
Niek De Klein ★ 2.6k

I want to get all proteins from the NCBI nr datbase that are smaller than 200 amino acids. I want to use them to make a local database to blast for a target small protein. I tried downloading nr.gz from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA which is described as:

Sequence databases in FASTA format for use with the stand-alone BLAST programs.
These databases must be formatted using formatdb before they can be used with BLAST.

This was the closest thing I could find to get all the fasta sequences, but the database fasta format is not something I ever worked with, and because of the size of the file (10gb) I can only manage to open it with Less, and as far as I've seen it seems to be mostly the sequence headers.

So I'm looking for a way to download all nr protein sequences OR a different way to do a BLAST search against all proteins <= 200 a.a.

Thanks,
Niek

ncbi fasta protein • 4.5k views
ADD COMMENT
7
Entering edit mode
12.7 years ago

You can run your blast with an entrez query string of:

1:200[slen]

That'll restrict your subject sequence to be between 1 and 200 amino acids.

ADD COMMENT
0
Entering edit mode

Does this lower the e-values because the database size gets smaller?

ADD REPLY
1
Entering edit mode
12.7 years ago

You can download the blast formatted blast database and use the following line to get a Blast formatted database with all the sequences smaller than 200bp:

[?]

fastacmd -p T -D 1 | gawk '{if(substr($1,1,1) == ">") {if (NR>1) {printf "\n%s\t", substr($1,1,length($1)-1)} else {printf "%s\t", substr($1,1,length($1)-1)}} else {printf "%s", $0}} END{printf "\n"}' | gawk 'BEGIN{OFS="\n"}length($2) < 401{print $1,$2}' | formatdb -p T -n nr_smaller_than_200a -i stdin

It first uses fastacmd to convert the Blast db in fasta format (if you already have it in fasta, you can skip this step). The first gawk command transforms fasta sequences in tab-delimited (tbl) format. The second gawk filters by length (<201aa) and outputs again in fasta format. The final formatdb convert the sequences (<=200aa) in a new database with name "nr_smaller_than_200aa".

ADD COMMENT
0
Entering edit mode
12.7 years ago
Malcolm.Cook ★ 1.5k

You can also blast against just NCBI's short nr proteins by providing the entrez query '1:200[slen]' as a filter on the blast web page.

Or, if you prefer to run from command line and don't want to download any fasta databases, assuming you've installed BLAST+ from NCBI, you can use use these options to your blast command:

   -db nr -remote -entrez_query '1:200[slen]'
ADD COMMENT

Login before adding your answer.

Traffic: 1479 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6