Question

Speeding Up Psipred (Or Finding A Better Locally-Run Alternative)

0

Entering edit mode

11.4 years ago

Dave ▴ 80

I know there have already been a few general questions about Secondary Structure Prediction, but hopefully mine is a bit more specific.

I am currently using psipred as part of a pipeline where I need secondary-structure and wondering if there is any way I can speed up the process a little:

Running locally, on a reasonably-powerful CPU, psipred takes on average 5-10 minutes (mainly for blasting with psiblast+) on medium-sized sequences (less than 300 letters), when the online version resolves it in 1 or 2.

Furthermore, a lot of the queries involve known proteins, for which a 100% match exists, so I would expect the match to be returned on the first iteration and the search to be terminated (instead, I suspect the tool keeps running until the end and lesser matches are found).

I know I can lower the number of iterations, but it doesn't seem to bring the execution time by much (and I would run the risk of not finding any homology, on the off-chance that there is no exact match).

Is there any other way I could modify the parameters given to psiblast in the runpsipred script, to speed things up (even at the expense of some precision)?

For example a way I could make it stop immediately if an exact match is found (I reckon this should be sufficient for the rest of psipred's algorithm).

For anybody who may be familiar with psiblast, but not psipred, here is the command currently used by the psipred pipeline:

$ncbidir/psiblast -db $dbname -query $tmproot.fasta -inclusion_ethresh 0.001 -out_pssm $tmproot.chk -num_iterations 3 -num_alignments 0 >& $tmproot.blast

The PSSM file (-out_pssm) is the important output for the rest of the algorithm.

Alternative question: can anybody recommend a tool with prediction performances, that can be run locally (and as a command line) with better speed performances?

protein-structure • 3.7k views

ADD COMMENT • link updated 11.2 years ago by Hamish ★ 3.3k • written 11.4 years ago by Dave ▴ 80

score 2 · Accepted Answer · 2013-11-01

Given that PSI-BLAST command, the obvious first thing to try is to enable multi-threading:

 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote

Typically this is set to the number of available cores in the system, although some folks like to keep it a little lower to ensure other processes don't suffer, and others set it a little higher to compensate for controller/monitoring threads that require little CPU time. So you may want to experiment.

As always the smaller the database the faster the search, so while the default processed UniRef90 database works well, you may want to consider if you can use a subset of the database. For example you could exclude all the sequences which are unlikely to occur in your organism (e.g. for searches with prokaryote sequences you may want to remove the UniRef90 sequences which only occur in eukaryotes).

Due to the way the PSSM produced by PSI-BLAST is used, you do not want to decrease the number of iterations. The issue is not failure to find homology (PSI-BLAST will exit immediately if an iteration finds no hits which can be included in the PSSM), but that the PSSM is more informative if it includes more sequences (each iteration recruits sequences into the PSSM).