How big is the protein database on NCBI and how do I download it?
3
1
Entering edit mode
23 months ago
nyck33 ▴ 10

I want local copies because I am trying to implement parallel blast using Apache Spark. I am choosing this particular domain, bioinformatics, because I took an intro course. I apologize if this question comes across as a bit lazy but I thought there might be multiple options. I'm looking here: https://ftp.ncbi.nlm.nih.gov/ But can't figure out which one the protein database is.

My plan was to download Sars-covid protein sequences and run Blast.

ncbi blast • 1.9k views
ADD COMMENT
0
Entering edit mode
23 months ago
GenoMax 147k

There are many pre-formatted blast protein databases available at NCBI: https://ftp.ncbi.nih.gov/blast/db/ They can be hundreds of GB. See this README file (section 2.2) to identify the protein ones.

ADD COMMENT
0
Entering edit mode

Yes, I saw this too: https://ftp.ncbi.nlm.nih.gov/blast/db/

refseq_protein.00.tar.gz 2022-12-06 01:45 8.9G
refseq_protein.00.tar.gz.md5 2022-12-06 01:45 59
refseq_protein.01.tar.gz 2022-12-06 01:45 2.1G
refseq_protein.01.tar.gz.md5 2022-12-06 01:45 59
refseq_protein.02.tar.gz 2022-12-06 01:45 2.1G
refseq_protein.02.tar.gz.md5 2022-12-06 01:45 59
refseq_protein.03.tar.gz 2022-12-06 01:45 2.1G
refseq_protein.03.tar.gz.md5 2022-12-06 01:45 59
refseq_protein.04.tar.gz 2022-12-06 01:45 2.1G
refseq_protein.04.tar.gz.md5 2022-12-06 01:45 59
refseq_protein.05.tar.gz 2022-12-06 01:46 2.1G
refseq_protein.05.tar.gz.md5 2022-12-06 01:46 59
refseq_protein.06.tar.gz 2022-12-06 01:46 2.1G
refseq_protein.06.tar.gz.md5 2022-12-06 01:46 59
refseq_protein.07.tar.gz 2022-12-06 01:46 2.1G
refseq_protein.07.tar.gz.md5 2022-12-06 01:46 59
refseq_protein.08.tar.gz 2022-12-06 01:46 2.1G
refseq_protein.08.tar.gz.md5 2022-12-06 01:46 59
refseq_protein.09.tar.gz 2022-12-06 01:46 2.1G
refseq_protein.09.tar.gz.md5 2022-12-06 01:46 59
refseq_protein.10.tar.gz 2022-12-06 01:47 2.1G
refseq_protein.10.tar.gz.md5 2022-12-06 01:47 59
refseq_protein.11.tar.gz 2022-12-06 01:47 2.1G
refseq_protein.11.tar.gz.md5 2022-12-06 01:47 59
refseq_protein.12.tar.gz 2022-12-06 01:47 2.1G
refseq_protein.12.tar.gz.md5 2022-12-06 01:47 59
refseq_protein.13.tar.gz 2022-12-06 01:47 2.1G
refseq_protein.13.tar.gz.md5 2022-12-06 01:47 59
refseq_protein.14.tar.gz 2022-12-06 01:47 2.1G
refseq_protein.14.tar.gz.md5 2022-12-06 01:47 59
refseq_protein.15.tar.gz 2022-12-06 01:47 2.1G
refseq_protein.15.tar.gz.md5 2022-12-06 01:47 59
refseq_protein.16.tar.gz 2022-12-06 01:48 2.1G
refseq_protein.16.tar.gz.md5 2022-12-06 01:48 59
refseq_protein.17.tar.gz 2022-12-06 01:48 2.1G
refseq_protein.17.tar.gz.md5 2022-12-06 01:48 59
refseq_protein.18.tar.gz 2022-12-06 01:48 2.1G
refseq_protein.18.tar.gz.md5 2022-12-06 01:48 59
refseq_protein.19.tar.gz 2022-12-06 01:48 2.1G
refseq_protein.19.tar.gz.md5 2022-12-06 01:48 59
refseq_protein.20.tar.gz 2022-12-06 01:48 2.1G
refseq_protein.20.tar.gz.md5 2022-12-06 01:48 59
refseq_protein.21.tar.gz 2022-12-06 01:48 2.1G
refseq_protein.21.tar.gz.md5 2022-12-06 01:48 59
refseq_protein.22.tar.gz 2022-12-06 01:49 2.1G
refseq_protein.22.tar.gz.md5 2022-12-06 01:49 59
refseq_protein.23.tar.gz 2022-12-06 01:49 2.1G
refseq_protein.23.tar.gz.md5 2022-12-06 01:49 59
refseq_protein.24.tar.gz 2022-12-06 01:49 2.1G
refseq_protein.24.tar.gz.md5 2022-12-06 01:49 59
refseq_protein.25.tar.gz 2022-12-06 01:49 2.1G
refseq_protein.25.tar.gz.md5 2022-12-06 01:49 59
refseq_protein.26.tar.gz 2022-12-06 01:49 2.1G
refseq_protein.26.tar.gz.md5 2022-12-06 01:49 59
refseq_protein.27.tar.gz 2022-12-06 01:50 2.1G
refseq_protein.27.tar.gz.md5 2022-12-06 01:50 59
refseq_protein.28.tar.gz 2022-12-06 01:50 2.1G
refseq_protein.28.tar.gz.md5 2022-12-06 01:50 59
refseq_protein.29.tar.gz 2022-12-06 01:50 2.1G
refseq_protein.29.tar.gz.md5 2022-12-06 01:50 59
refseq_protein.30.tar.gz 2022-12-06 01:50 1.5G
refseq_protein.30.tar.gz.md5 2022-12-06 01:50 59

That is quite big if I have to download every number for the full set.

Thanks.

ADD REPLY
0
Entering edit mode

You can either make your own database or if you want a preformatted one then pataa.tar.gz may be on the smaller end.

ADD REPLY
0
Entering edit mode
23 months ago

If your purpose is to get a small dataset of SARs-Covid proteins to use as queries in BLAST, you might first try going to the SARs Genome pages at NCBI, where there is probably a FASTA file containing all all proteins from the virus. You didn't say what you wanted to search for using BLAST. The SwissProt/Uniprot database is less than 400 Mb, and is the best first place to search for known proteins, including viral proteins. If you really need RefSeq, yes that involves downloading all the RefSeq files and decompressing them.

If you're using Linux or Mac, the BIRCH system has easy point and click tools implemented using BioLegato, including tasks such as doing Entrez queries to find proteins using keywords, as seen in the tutorial on Creating datasets of related sequences. BioLegato gives you a complete point and click system for installing, updating and managing BLAST databases on your system: Installing local copies of NCBI databases. One of the functions creates a spreadsheet of databases available at NCBI with an estimate of final uncompressed size, and a report of locally-installed copies of the databases. Of course, BioLegato can also search these databases using BLAST. If you want to implement Apache Spark for parallel blast, BIRCH makes it easy to seamlessly add new programs to the BioLegato menus. The BLAST database management tools can be seen in action in the video Installing BLAST databases on your own computer.

ADD COMMENT
0
Entering edit mode
23 months ago
lennykovac ▴ 110

There is a perl script in your installation directory of blast, which downloads and updated your databases! (update_blastdb.pl)

From the NCBI documentation

nr.##.tar.gz A collection of protein sequences with entries from GenPept, Swissprot, PDB, PRF, PIR and NCBI Reference Sequence (RefSeq) project.

You can invoke the script and it will download the desired databases!

cd blast/bin
perl update_blastdb.pl --showall                          # will show you all available databases
perl update_blastdb.pl --decompress nr                    # will download the whole protein database
perl update_blastdb.pl --decompress nr.00                 # will download a subset of the protein database

Or if you don't want to invoke the default script, try wget.

wget -b "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.??.tar.gz"        # will download the whole protein database
wget -b "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz"        # will only download the given subset of the protein database (00) 

Both methods will work for teh refseq_protein aswell, you just have to specify the name.

ADD COMMENT

Login before adding your answer.

Traffic: 2739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6