Question

makeblastdb creating multiple files of unexpectedly large sizes

0

Entering edit mode

2.9 years ago

accibio ▴ 20

I have a set of 100 amino acid sequences and I want to perform a BLASTP sesrch against the refseq_protein database. Accordingly I had set up the standalone version of BLAST (Version 2.11.0+) and downloaded the refseq_protein database from NCBI using the following code

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*.faa.gz

The database gets downloaded as 3027 zipped files containing FASTA sequences. I unzipped all these files and concatenated them into a single file refseq_protein.faa (which is around 95 GB in size). Now when I run the following Python code

from Bio.Blast.Applications import NcbimakeblastdbCommandline
from Bio.Blast.Applications import NcbiblastpCommandline

cline = NcbimakeblastdbCommandline(dbtype = "prot", input_file = "D:\\refseq_protein.faa", out 
= "refseq_protein")

blastp_cline = NcbiblastpCommandline(query = "D:\\DEP_sequences.fasta', db = 
"refseq_protein", evalue = 0.01, outfmt = "7 sseqid evalue qcovs pident")

cline()

response = blastp_cline()

the NcbimakeblastdbCommandline function keeps creating multiple .phr, .pin, .psq etc files which take up a lot of space (In a demo run it had created ~30GB of these files and was still running). I'm afraid this will exhaust the entire space available on my internal hard drive. I'm wondering if there is a way to estimate the total size of the files which NcbimakeblastdbCommandline would create. This will help me in deciding whether or not to switch to an external storage to perform the BLASTP search.

I am aware of the fact that pre-formatted refseq_protein database exists but I'm not sure what value is to be passed in the db parameter of the NcbiblastpCommandline function, because it asks for the name of the database against which the BLASTP search is to be performed. In the approach that I chose, I had the liberty to set the name of the database.

Any suggestions on how to solve this issue would be appreciated.

biopython BLAST makeblastdb refseq_protein FASTA • 1.4k views

ADD COMMENT • link updated 2.9 years ago by GenoMax 147k • written 2.9 years ago by accibio ▴ 20

1

Entering edit mode

refseq_protein database exists but I'm not sure what value is to be passed in the db parameter of the NcbiblastpCommandline function

If you download refseq_protein database files from NCBI simply use refseq_protein as the base name for database in your blastp commandline.

makeblastdb creating multiple files of unexpectedly large size

When your input fasta is 95GB you are going to get several large files when you make the database. That is normal.

ADD REPLY • link 2.9 years ago by GenoMax 147k

0

Entering edit mode

Is there a ratio between the size of the input fasta file and that of the generated files? Would 1TB of storage be sufficient?

ADD REPLY • link 2.9 years ago by accibio ▴ 20

1

Entering edit mode

refseq_protein pre-formatted database files appear to be about 178G as of today. So 1TB should be enough.

ADD REPLY • link 2.9 years ago by GenoMax 147k