Don't go near 126 thousand; the max threads for an operating system includes moving the hard-drives and networking, and everything else. See "ps aux" for a list of how many things are already active!
Choosing a threading level for a program is based on a dozen factors, usually if you want to do it if you have more CPU cores sitting idle, while the single-threaded version of psiblast is running 100% of one core. Try two and see if you can get two cores to 100%. Maximally you'd benefit from the number of threads = number of CPU cores, unless there is other non-cpu constraints.
It depends on the system architecture and the software architecture. In my experience, a dual processor, quad core (for 8 threads max) can do something like compute pi in eight threads efficiently; If instead of computing pi, you were reading big files from disk, and have only the one disk, youll see any more than 1-2 threads slowing each other down as they have to wait for data. Running six threads accessing the harddrive will be slower than three. This ratio depends on how much of each resource is needed by each thread.
Something like sequence alignment (BWA or Bowtie etc) needs to read a little data, then crunch a lot of CPU, so spinning up all 8 threads is fine, theyll wait their turn for data, and then get off-sync from each other and end up with 100% disk utilization and almost 8x100% CPU.
If your processes are for example requesting data from a web-server with some unknown delays, then you could run a dozen or a hundred threads and theyll wait and go when they can.
It also depends on the motherboard architecture. You probably only have 2 or 4 channels to access RAM, so running more than 4 threads that need high volume access to RAM will also slow each other down.
The key word is contention. The usual solution is trial and error; you will have to measure the speed for various threads-settings and choose an optimal for your task. I dont know about psiblast specifically, but it probably needs to access RAM quickly, and youll see it get slower per thread after 5 threads. Maybe the optimal is 6-8, but I guarantee trying 1,000 simultaneously will not be faster than 10.
Finally, of course is the level of parallelism available to the algorithm, sometimes BLAST has to work sequentially and will ignore your thread setting for some parts of the job, so these optimal settings can vary with the reference genome and query sequences.
thank you for the very comprehensive answer. I've tried using 4 threads, and using htop the look at CPU usage each core is using around 90% of CPU useage (if I use the top command this comes up as 360%! which confused me at first).
there are a lot of things to consider and I hope your answer helps others, not just me.
Yeah I think it's fun to think about. What does 90% CPU mean? It would be 100% if it wasn't waiting for something like disk or memory. Each additional thread will cause a little more contention and reduce your 90% a little more. If you have more than 4 cores available, don't stop at 360%, try to push it up as high as possible. 6 threads at 70% is more throughput than 4 threads at 90%, but not by much. At some point, if it's not enough, other things have to change, for example if you have two harddisks and you can read from one while writing to the other, that will reduce contention and boost your CPU utilization.