Hi All,
The team I work for is considering some new hardware that we think has the potential to show good performance on multithreaded bioinformatics/genomics software. To test this, I want to put together some performance benchmarks and am considering which codes to include. My inclination is to test some of the standard alignment tools like:
- bwa mem
- minimap2
- bowtie2
- ncbi blast+
- diamond
- STAR
All of these are multithreaded and would allow us to see how well they scale to several or many cores. However they all involve some non-trivial amount of I/O, which might be a downside.
If you were performance benchmarking a new system using bioinformatics tools, what codes would you test?
Thanks! Dave
So you are not planning to configure a corresponding upgrade for storage side?
I am curious as to what this performance benchmark is supposed to demonstrate? Satisfaction that something now runs in 2 mins that used to take 20? Justification and/or bragging rights for having acquired a speedy system?
Seconded. Are you looking for fast performance of a single task/pipeline, or non-degrading performance when 5-15 people in the lab are all running different pipelines on the server, possibly targeting the same filesystem?
These are good questions. We will be getting a new storage upgrade, I imagine, but we will be running initial tests of the hardware in a vendor-controlled instance, which won't use the storage system that it would eventually run on (if we purchase it). The goal is not bragging rights since we have not actually purchase anything yet, but we want to assess the performance of some of the applications that might eventually run on the hardware.
One of the claimed selling points of the system is that it seems to show unusually good scaling for threaded, CPU-bound codes. I was hoping to test some bioinformatics codes to see if they show the good scaling that some others (reportedly) have shown, but since the storage system we'll eventually use is an open question, I didn't want to have to worry about too much I/O during the tests, if I can help it. That may not be a sensible idea, though.
I think we're mostly interested in testing the scaling of a single multi-threaded task right now. Finding a popular bioinformatics application to test that would be useful for us. I can always launch a bunch of nf-core pipelines if we want to test the throughput performance of lots of small applications running at the same time.
You can try pseudoalignment (like the
kallisto quant
command in kallisto) because it doesn't write anything to disk (no BAM files, no temporary files, etc.) while processing data.