We're trying to use bowtie2 to find exact matches to short DNA sequences in a complete genome. We may search for hundreds of thousands of short sequences at a time. At first, it seemed like the way to do this is to spawn a bunch of threads and run lots of separate queries in parallel. However, we're finding that on a 30 core machine hitting a single index on a local disk, any more than 3 threads results in a significant slowdown.
We are using the '--mm' option which according to the manual, tells bowtie2 to use memory-mapped I/O so many bowtie's can share the index. Used interactively for a single query, --mm resulted in a noticeable speedup. However, I'm wondering if we're running into problems where the shared, memory-mapped I/O requires some mutex coordination which is causing things to bog down when hit by multiple threads. In that case, could we increase throughput by taking a hit each individual query but utilize our full 30 cores?
Yes. In our standard scheme, this doesn't help, because we handle each sequence by a separate query (and therefore, a heavyweight process). Adding threads in that way doesn't seem to help because there's not enough work done processing a single query to justify the threads. We could try handling multiple queries in a single process, in which case '-p' might help, but that works best for batch rather than on-line processing, and we need to be able to do both quickly.
It's not documented anywhere, but the bowtie2 source code suggests that you should be able to compile it easily enough as a library, so perhaps you can directly integrate it into whatever your current pipeline is that way.
BTW, the other possibility would just be to use a FIFO. Whether this will work will depend on the details of your pipeline.