Hello,
I have a fasta file including 140 protein sequences from distinct viruses and I would like to identify which protein comes from which virus.
I am using a Linux cluster, BLAST is available as a cluster module, and the viruses and NCBI nr databases are stored in my own directory(correct me if I used the wrong terminology) in the cluster.
I set up my blastp as below:
blastp -db nr -query proteins.fa -outfmt 6 -out ./output.txt -num_threads 10 -max_target_seqs 1
and requested the resources from cluster as:
#PBS -l mem=64gb,nodes=10:ppn=1,walltime=10:00:00
It has been running for around 10 hours and I haven’t got any results written in the output.txt
. I am wondering if there is a better way to set up RAM, nodes, or process per node to speed up BLASTp run. Thank you so much!
Here is the info about the Linux cluster:
66 compute nodes. Each node has two 14-core Intel processors (2.40GHz) sharing 128 GB of memory.
Have you downloaded all files for
nr
database from NCBI and uncompressed them in your directory. If you take a single sequence and try to run a quick search against this database do you see results in < 30 min (it will take a while to read the database files).Thank you so much for the reply. I did download and uncompress all nr databases from NBCI in my directory. Taking your suggestion and suggestions from below. I am running a
-num-threads 10
blastp to search single sequence against all nr databases, by usingmem=120gb,nodes=1:ppn=14
. Hope this will run faster.Also, do you have any suggested method to limit the protein sequence database to that only comes from viruses?
You can use
-taxids 10239
(taxID for viruses) option in yourblastp
to limit your local search for viruses. This will require you to download the taxonomy file from the same location where you downloadednr
indexes and keep it in the same directory as your blast indexes.It's over two hours since I initiated a single sequence blastp against all nr databases as I mentioned in my previous reply, and It hasn't completed it.
So, I am considering building a local database only including protein sequences from viruses.
How to download all the virus protein data from NCBI?
I found a website here, but not sure how to download all fasta files from the command line or using any available tool.
I think you are best off getting the viral proteins from the link Mensur Dlakic had provided below for UniProt.
That said you can download using
Download
button on the page you linked above from NCBI.Loading the nr DB in memory (especially with the newest binaries) you will need to request all the mem of node (120GB should be OK to use the DB, the requested 64gb will likely not work).