I have a set of 100 amino acid sequences and I want to perform a BLASTP sesrch against the refseq_protein
database. Accordingly I had set up the standalone version of BLAST (Version 2.11.0+) and downloaded the refseq_protein database from NCBI using the following code
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*.faa.gz
The database gets downloaded as 3027 zipped files containing FASTA sequences. I unzipped all these files and concatenated them into a single file refseq_protein.faa
(which is around 95 GB in size). Now when I run the following Python code
from Bio.Blast.Applications import NcbimakeblastdbCommandline
from Bio.Blast.Applications import NcbiblastpCommandline
cline = NcbimakeblastdbCommandline(dbtype = "prot", input_file = "D:\\refseq_protein.faa", out
= "refseq_protein")
blastp_cline = NcbiblastpCommandline(query = "D:\\DEP_sequences.fasta', db =
"refseq_protein", evalue = 0.01, outfmt = "7 sseqid evalue qcovs pident")
cline()
response = blastp_cline()
the NcbimakeblastdbCommandline
function keeps creating multiple .phr
, .pin
, .psq
etc files which take up a lot of space (In a demo run it had created ~30GB of these files and was still running). I'm afraid this will exhaust the entire space available on my internal hard drive. I'm wondering if there is a way to estimate the total size of the files which NcbimakeblastdbCommandline
would create. This will help me in deciding whether or not to switch to an external storage to perform the BLASTP search.
I am aware of the fact that pre-formatted refseq_protein database exists but I'm not sure what value is to be passed in the db
parameter of the NcbiblastpCommandline
function, because it asks for the name of the database against which the BLASTP search is to be performed. In the approach that I chose, I had the liberty to set the name of the database.
Any suggestions on how to solve this issue would be appreciated.
If you download
refseq_protein
database files from NCBI simply userefseq_protein
as the base name for database in yourblastp
commandline.When your input fasta is 95GB you are going to get several large files when you make the database. That is normal.
Is there a ratio between the size of the input fasta file and that of the generated files? Would 1TB of storage be sufficient?
refseq_protein
pre-formatted database files appear to be about 178G as of today. So 1TB should be enough.