I am running webservice. The users can query database of 10M+ proteins by sequence similarity. However, blast performance is not enough (several minutes per query).
Can you recommend some faster alternatives? BLAT is much faster, but loading all proteins every time is not effective...
Or maybe some blastp tweaking?
I can sacrifice sensitivity, as I'm looking for very similar matches (>90% identity). It would be great, if I can retrieve protein sequences from db easily, so I don't have to store sequence twice (like fastacmd in blast).
Note, I'm bound to 1 cpu. Surprisingly, increasing word size (-W 7) didn't increase blastp performance.
UPDATE
In the end, I came up with my own solution based on kmers stored in MySQL and BLATing only subset of proteins. It's able to find similar (didn't tested that, but >50% are captured easily) to database of 13M sequences for single query in seconds. In contrast, BLASTp would take several minutes (12-15min), and other solutions like LAST or Vmatch didn't go below 1min.
Let me know if someone is interested in that. It's still quite simplistic, but someone may benefit :)
Why do you need to load all proteins every time with BLAT? Since you're running a webservice, why not run a BLAT server?
is it possible to run blat server for proteins?
I think so. Isn't that how the UCSC blat works? http://genome.ucsc.edu/FAQ/FAQblat.html#blat5
yeap, but it's for DNA, not for protein... I cannot run server for aminos:/
You can blat amino acid sequences with the same ease on UCSC. Hence my belief that it is possible to run a protein blat server.
then I will appreciate if you can suggest how to do it. I have tried gfSever (BLAT34) but cannot make it working with proteins as it requires .2bit (handle only DNA) or .nib (one sequence per file).