Question

How to Speed Up BLASTp

0

Entering edit mode

4.0 years ago

twangxxx • 0

Hello,

I have a fasta file including 140 protein sequences from distinct viruses and I would like to identify which protein comes from which virus.

I am using a Linux cluster, BLAST is available as a cluster module, and the viruses and NCBI nr databases are stored in my own directory(correct me if I used the wrong terminology) in the cluster.

I set up my blastp as below:

 blastp -db nr -query proteins.fa -outfmt 6 -out ./output.txt  -num_threads 10 -max_target_seqs 1

and requested the resources from cluster as:

#PBS -l mem=64gb,nodes=10:ppn=1,walltime=10:00:00

It has been running for around 10 hours and I haven’t got any results written in the output.txt. I am wondering if there is a better way to set up RAM, nodes, or process per node to speed up BLASTp run. Thank you so much!

Here is the info about the Linux cluster:

66 compute nodes. Each node has two 14-core Intel processors (2.40GHz) sharing 128 GB of memory.

blastp linux-cluster BLAST nr-database • 6.1k views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 29k • written 4.0 years ago by twangxxx • 0

1

Entering edit mode

Have you downloaded all files for nr database from NCBI and uncompressed them in your directory. If you take a single sequence and try to run a quick search against this database do you see results in < 30 min (it will take a while to read the database files).

ADD REPLY • link 4.0 years ago by GenoMax 151k

0

Entering edit mode

Thank you so much for the reply. I did download and uncompress all nr databases from NBCI in my directory. Taking your suggestion and suggestions from below. I am running a -num-threads 10 blastp to search single sequence against all nr databases, by using mem=120gb,nodes=1:ppn=14. Hope this will run faster.

Also, do you have any suggested method to limit the protein sequence database to that only comes from viruses?

ADD REPLY • link 4.0 years ago by twangxxx • 0

0

Entering edit mode

You can use -taxids 10239 (taxID for viruses) option in your blastp to limit your local search for viruses. This will require you to download the taxonomy file from the same location where you downloaded nr indexes and keep it in the same directory as your blast indexes.

ADD REPLY • link 4.0 years ago by GenoMax 151k

0

Entering edit mode

It's over two hours since I initiated a single sequence blastp against all nr databases as I mentioned in my previous reply, and It hasn't completed it.

I am running a -num-threads 10 blastp to search single sequence against all nr databases, by using mem=120gb,nodes=1:ppn=14. Hope this will run faster.

So, I am considering building a local database only including protein sequences from viruses.

How to download all the virus protein data from NCBI?

I found a website here, but not sure how to download all fasta files from the command line or using any available tool.

ADD REPLY • link 4.0 years ago by twangxxx • 0

0

Entering edit mode

considering building a local database only including protein sequences from viruses.

I think you are best off getting the viral proteins from the link Mensur Dlakic had provided below for UniProt.

That said you can download using Download button on the page you linked above from NCBI.

ADD REPLY • link 4.0 years ago by GenoMax 151k

0

Entering edit mode

Loading the nr DB in memory (especially with the newest binaries) you will need to request all the mem of node (120GB should be OK to use the DB, the requested 64gb will likely not work).

ADD REPLY • link 4.0 years ago by lieven.sterck 15k

score 1 · Answer 1 · 2021-06-21

1

Entering edit mode

4.0 years ago

h.mon 35k

You are requesting 10 nodes and 1 processor per node, however, blastp can only use one node. You should use:

#PBS -l mem=128gb,nodes=1:ppn=14,walltime=10:00:00

There are ways of splitting the input fasta file and submitting to several nodes, but with 140 sequences as input, it is not necessary.

You should contact the cluster administrators for instructions on how to properly use Torque / PBS resource manager. And before downloading NT / NR, you should also ask if these databases are already available at a centrally managed location - as they are widely used, this is commonly the case.

ADD COMMENT • link 4.0 years ago by h.mon 35k

0

Entering edit mode

Thank you so much, I changed my PBS setting as you suggested. I am afraid there is no database available in a shared location in the cluster, so I downloaded and uncompressed the whole NCBI nr database in my directory.

Also, I am wondering how to properly set up -num_threads in blastp command to speed up based on this PBS request.

ADD REPLY • link 4.0 years ago by twangxxx • 0

0

Entering edit mode

You can try the variable $PBS_NUM_PPN (number of CPUs per node):

 blastp -db nr -query proteins.fa -outfmt 6 -out ./output.txt  \
    -num_threads $PBS_NUM_PPN -max_target_seqs 1

Again, the clusters administrators will probably be better positioned to help you.

If you are asking about how much to ask, ask for all processors (and memory), as NT / NR are really big.

ADD REPLY • link 4.0 years ago by h.mon 35k

score 1 · Answer 2 · 2021-06-21

1

Entering edit mode

4.0 years ago

Mensur Dlakic ★ 29k

Another thing that may help is searching against a virus-only database, since at least 99.5% of nr are non-viral entries. Specific taxonomic entries can be downloaded from this link:

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/

There are two files (sprot and trembl) for each group, and you would need the .dat.gz files. Those are in EMBL format, so you will need a program to convert them to FASTA. I know that a little utility called esl-reformat from the HMMer package can do it, and there are likely to be others.

ADD COMMENT • link 4.0 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you for reply. I read the manual of HMMer package and found that esl-reformat utility is for nucleotide sequence format conversion. It probably won't work for protein sequence. Do you have any other tools recommended?

ADD REPLY • link 4.0 years ago by twangxxx • 0

1

Entering edit mode

esl-reformat works for protein sequences. In fact, it will automatically figure out the type of sequence, although it can be specified on the command-line if needed. It is easy enough, why don't you give it a try?

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 29k