Hi,
I'm wondering:
If I want to blast a large number of protein sequences against the ncbi-nr database (say for example in order to analyse the species and function composition with MEGAN), which of these options would be more sensible:
A.) to split the queries into subsets and run more jobs in parallel (using less threads each)
or
B.) to blast all queries in one job but using more threads
or doesn't it matter at all which of both I choose?
I was under the impression that simply using twice as many threads should have almost exactly the same effect as splitting the query data in two subsets and running them in parallel. Is this assumption wrong?
typically threaded programs can share memory space but may contend for resources that only one thread may use at a time (for example updating shared values) and there may be other overheads associated with the threading.
When running separate processes memory is not shared and there is no contention between programs (other than that for the overall computational resources). But each program may load a separate copy the same information.
The exact amounts of overheads are typically not that easy to estimate but for example running ten blast processes independently will quite certainly use a lot more memory than running a blast process with ten threads.
Yes, the thought that, if i split the data and ran multiple processes, the complete ncbi-nr database would have to be loaded into memory multiple times made me prefer running more threads for one process than more processes.
It DOES seem that my blast is in fact consistently using all 12 threads that i assigned to it. It usually appears as "sleeping" (even though it is using 1200% cpu) when I look it up with "top", but nonetheless it produces output, so it is running.
However, it is taking so impossibly long (it took over a week to blast 5000 sequences against nr) that i will have to consider splitting the input-data and taking up more memory