As the other contributors mentioned, you can get quite a long way with high memory, multi-core hardware, ~5 to 10 concurrents depending on the range and size of the sequence databases. Adding more memory or cores will help (vertical scaling), but you'll see diminishing returns. For spiky or higher level usage you're going to need to start distributing the load across multiple boxes (horizontal scaling).
For a more scalable system, consider provisioning a collection of servers as a processing cluster. Best practices for batch processing apply here: the nodes should be 'share nothing' with tasks distributed via message queues.
For BLAST, it can be more cost effective to run a larger number of less powerful servers.
A few other topics to consider to optimise such a system for throughput and cost:
Shard by database and usage
You can provide different queues to route searches to specific groups of servers. High use or large datasets can occupy their own dedicated, heavyweight infrastructure, whilst lower usage or smaller datasets can happily coexist on smaller, cheaper hardware. Monitor usage, response times and latency to gauge the best bang for buck.
Queue-aware compute
You could investigate the possibility of running the searches against elastic compute with services such as EC2 (*). With message queues and horizontal scaling, running on utility computing can allow you to increase your capacity under increased demand, and reduce it as demand subsides (evenings, weekends etc).
Caching
Reduce the overhead of repeat submission (very common with BLAST!), by caching the input parameters and search results in a database. If a user repeats a search, just return the result immediately.
Friendly wrappers
A bit OT, but important for uptake of a distributed system: make it easy for your users to submit their searches. Depending on the technical knowledge of your users, grid-engine-style tools can help. However for short tasks which are submitted often (such as BLAST, format exchange, Radar, Needle, etc), some users may find them heavyweight. Instead, you can hide a lot of this complexity by providing a thin wrapper to your users that submits their task to a queue, and polls or awaits notification that the task has completed before returning results locally.
So - in answer to your question, the horsepower of the physical hardware is only one factor in determining throughput and concurrency. There are a number of architectural factors that can help you scale up too.
- = heads up, I work at Amazon.
@Alastair, thanks for this advice. Being no hardware person, I am a bit puzzled by your '24 threads on a dual-processor machine'. Does it make sense to run BLAST with more threads than there are processor cores ? Or do you have 12-core processors ?
Sounds like a Westmere system with 24 execution threads. I agree that 50 is not that many users. How many concurrent processes do you tend to have?
Deepak, could you please offer some esplanation for the biologist-bioinformatician with limited hardware expertise? I looked up Westmere in Wikipedia but only found references to 2-8 core processors. Can you run more than one thread per core, and does it make sense?
each cpu has 6 cores = 12 logical cores per CPU. The technology to do this is Intels "hyper-threading" and AMD has something similar (found on the latest chips). The operating system sees 24 CPUs and runs jobs on each. It is much faster than an equivalent 12 node cluster from 6 years ago.
each processor has 6 cores = 12 logical cores per processor. The technology to do this is Intels "hyper-threading" and AMD has something similar (found on the latest chips). The operating system sees 24 CPUs and runs jobs on each. It is much faster than an equivalent 12 node cluster from 6 years ago.
The most imporant question really is the type of data that is going to be run through BLAST. People running a handful of queries sporadically will be very different from people running large datasets.