For the purpose of my project I need to break down a genome and run blast for each part. Because of the amount of genome file it would be at least 4000 call of blast which would take a lot of time. I'm using NCBIQBlastService to do my alignment remotely and as I checked for each request it would take 20 sec so for the whole 4000 it would take around a day. Is there any other way to do this faster. any suggestion would be really appriciated.
and BTW this might help too
http://biojava.org/wiki/BioJava:CookBook3:NCBIQBlastService
There are multiple ways to speed up a BLAST analysis. For a start, if you run your BLAST locally it will be faster than sending all the data back and forth between NCBI. Can you run BLAT instead?
I guess if he runs BLAT, he will miss lots of homologous sequences he might be interested in as Blast is more sensitive than BLAT because blast uses a smaller window size of 3 when it looks for homologous seauences whereas BLAT uses a longer Window size. I usually don't prefer BLAT instead of Blast unless I look for highly similarities or do mapping. Even BLAT will take quite long time unless you run a parallel BLAT means you need to divide your sequences in many segments and run the BLAT and finally put the output back together.
+1, and I absolutely agree about BLAT. For a lot of what I do, BLAT can suffice and saves a little bit of time. When I have to identify millions of environmental sequences against an extremely large databases, you're not exactly going to get high levels of confidence anyway.
Perhaps you could send your searches in 25-jobs-at-one-time batches to EMBL's NCBI BLAST REST-based service. At a 25:1 ratio, a set of jobs that take a day would take a little less than an hour (all other things being equal).
NCBI's BLAST web services (see http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=DeveloperInfo) have different usage restrictions, One of these being limiting the frequency of requests. Given the average runtime of your jobs as 20s that a request per 3s suggests about 6 jobs could be run in parallel. If the query sequences can be batched, so each job performs 10 searches, than the average job time would increase to about 200s and since the request frequency is what is being limited that translates into significant parallelism.
All that said the databases available at EMBL-EBI are not the same as those available from NCBI, so the database choice may force the use of one particular service.
As suggested by Josh Herr, you can use blast on your computer to perform large numbers of blasts faster.
The easiest, from my perspective, way to do that would be to have Linux (or MacOSX) installed on a computer, install blast and desired databases and launch the blasts.
If you have no experience with UNIX-like systems, then you would probably need help from a person that is knowledgeable about this.
If you tell us a bit more about your experience, the computer you use or could use in the lab (installed systems, number of CPUs), we may be able to help you some more.
This thread has been there for a long time, but I would like to add a new tip for those who run ncbi-blast+ in their on computers: that if you place the database in a fast storage device (e.g., SSD), you will get a *dramatic* gain in speed!
I didn't do a serious benchmark, but estimated a 3-10 fold increase in speed. I also think memory will make a key contribution, if it is large enough (I guess 128GB is necessay for the whole nr), and if I can throw the whole database into memory somehow.
I guess if he runs BLAT, he will miss lots of homologous sequences he might be interested in as Blast is more sensitive than BLAT because blast uses a smaller window size of 3 when it looks for homologous seauences whereas BLAT uses a longer Window size. I usually don't prefer BLAT instead of Blast unless I look for highly similarities or do mapping. Even BLAT will take quite long time unless you run a parallel BLAT means you need to divide your sequences in many segments and run the BLAT and finally put the output back together.
+1, and I absolutely agree about BLAT. For a lot of what I do, BLAT can suffice and saves a little bit of time. When I have to identify millions of environmental sequences against an extremely large databases, you're not exactly going to get high levels of confidence anyway.