I have a fasta file including 140 protein sequences from distinct viruses and I would like to identify which protein comes from which virus.
I am using a Linux cluster, BLAST is available as a cluster module, and the viruses and NCBI nr databases are stored in my own directory(correct me if I used the wrong terminology) in the cluster.
It has been running for around 10 hours and I haven’t got any results written in the output.txt. I am wondering if there is a better way to set up RAM, nodes, or process per node to speed up BLASTp run. Thank you so much!
Here is the info about the Linux cluster:
66 compute nodes. Each node has two 14-core Intel
processors (2.40GHz) sharing 128 GB of memory.
Have you downloaded all files for nr database from NCBI and uncompressed them in your directory. If you take a single sequence and try to run a quick search against this database do you see results in < 30 min (it will take a while to read the database files).
Thank you so much for the reply. I did download and uncompress all nr databases from NBCI in my directory. Taking your suggestion and suggestions from below. I am running a -num-threads 10 blastp to search single sequence against all nr databases, by using mem=120gb,nodes=1:ppn=14. Hope this will run faster.
Also, do you have any suggested method to limit the protein sequence database to that only comes from viruses?
You can use -taxids 10239 (taxID for viruses) option in your blastp to limit your local search for viruses. This will require you to download the taxonomy file from the same location where you downloaded nr indexes and keep it in the same directory as your blast indexes.
It's over two hours since I initiated a single sequence blastp against all nr databases as I mentioned in my previous reply, and It hasn't completed it.
I am running a -num-threads 10 blastp to search single sequence
against all nr databases, by using mem=120gb,nodes=1:ppn=14. Hope this
will run faster.
So, I am considering building a local database only including protein sequences from viruses.
How to download all the virus protein data from NCBI?
I found a website here, but not sure how to download all fasta files from the command line or using any available tool.
Loading the nr DB in memory (especially with the newest binaries) you will need to request all the mem of node (120GB should be OK to use the DB, the requested 64gb will likely not work).
There are ways of splitting the input fasta file and submitting to several nodes, but with 140 sequences as input, it is not necessary.
You should contact the cluster administrators for instructions on how to properly use Torque / PBS resource manager. And before downloading NT / NR, you should also ask if these databases are already available at a centrally managed location - as they are widely used, this is commonly the case.
Thank you so much, I changed my PBS setting as you suggested. I am afraid there is no database available in a shared location in the cluster, so I downloaded and uncompressed the whole NCBI nr database in my directory.
Also, I am wondering how to properly set up -num_threads in blastp command to speed up based on this PBS request.
Another thing that may help is searching against a virus-only database, since at least 99.5% of nr are non-viral entries. Specific taxonomic entries can be downloaded from this link:
There are two files (sprot and trembl) for each group, and you would need the .dat.gz files. Those are in EMBL format, so you will need a program to convert them to FASTA. I know that a little utility called esl-reformat from the HMMer package can do it, and there are likely to be others.
Thank you for reply. I read the manual of HMMer package and found that esl-reformat utility is for nucleotide sequence format conversion. It probably won't work for protein sequence. Do you have any other tools recommended?
esl-reformat works for protein sequences. In fact, it will automatically figure out the type of sequence, although it can be specified on the command-line if needed. It is easy enough, why don't you give it a try?
Have you downloaded all files for
nr
database from NCBI and uncompressed them in your directory. If you take a single sequence and try to run a quick search against this database do you see results in < 30 min (it will take a while to read the database files).Thank you so much for the reply. I did download and uncompress all nr databases from NBCI in my directory. Taking your suggestion and suggestions from below. I am running a
-num-threads 10
blastp to search single sequence against all nr databases, by usingmem=120gb,nodes=1:ppn=14
. Hope this will run faster.Also, do you have any suggested method to limit the protein sequence database to that only comes from viruses?
You can use
-taxids 10239
(taxID for viruses) option in yourblastp
to limit your local search for viruses. This will require you to download the taxonomy file from the same location where you downloadednr
indexes and keep it in the same directory as your blast indexes.It's over two hours since I initiated a single sequence blastp against all nr databases as I mentioned in my previous reply, and It hasn't completed it.
So, I am considering building a local database only including protein sequences from viruses.
How to download all the virus protein data from NCBI?
I found a website here, but not sure how to download all fasta files from the command line or using any available tool.
I think you are best off getting the viral proteins from the link Mensur Dlakic had provided below for UniProt.
That said you can download using
Download
button on the page you linked above from NCBI.Loading the nr DB in memory (especially with the newest binaries) you will need to request all the mem of node (120GB should be OK to use the DB, the requested 64gb will likely not work).