Disk I/O Bound Genome Analysis Application
3
3
Entering edit mode
13.2 years ago
User 2724 ▴ 30

My institute just invested a new cluster equipped with a lustre file system. We would like to test the performance of the disk I/O with some popular bioinformatics applications such as genome alignment/mapping, file format converting (sam to bam), SNPs, etc.. But I am a computer science person with not much knowledge about the bioinformatics applications. Could someone suggest us with some genome analysis applications which will produce lots of disk I/O in a very short time. Thanks a lot.

• 3.4k views
ADD COMMENT
2
Entering edit mode
13.2 years ago

Assuming that you have a batch system (SGE, PBS, Torque, etc.), simply submit a bunch of read/write jobs such as sam to bam is probably a useful test. On our cluster, concurrent writes are the major performance bottleneck.

ADD COMMENT
0
Entering edit mode

Thanks. File format converting will be one of our target.

ADD REPLY
2
Entering edit mode
13.2 years ago
User 59 13k

Is there something wrong with using Bonnie++ or Iozone for this?

ADD COMMENT
0
Entering edit mode

I would second this suggestion. This is also much more complete a benchmark than just trying some I/O heavy bioinformatics application.

ADD REPLY
0
Entering edit mode

Thanks a lot for your suggestion. The reason I am asking help here is that our cluster is aimed to provide services to our biology department. So we would like to demonstrate the capacity of this cluster to biology people in more friendly and easy understanding language. We think using an example of a popular application is the best way to show them the difference. Thanks again

ADD REPLY
0
Entering edit mode

I wouldn't necessarily worry about demonstrating how great your disk IO is to your biologists. They're probably far more concerned about how fast you can return their results, not how fast you can write to the filesystem.

ADD REPLY
0
Entering edit mode

You usually want normal/standard benchmarks like SPEC, bonnie and the like, but it's always good to do a synthetic, almost real-world test to see if the hardware supports what you really do day-in day-out.

ADD REPLY
0
Entering edit mode
13.2 years ago

I had to write a similar benchmark.

I wrote a random fastq generator that follows the illumina hiseq error rate (kind of). Then I ran the first step of any pipeline, clipping adapters + read quality trimming.

It seems to do the job of stressing IO pretty nicely. The rest of the pipeline, Alignment, Realignment, snp calling, etc is pretty much CPU intensive, less IO (~80-20).

The trimming application was fastx_clipper, fastx_quality_trimmer, EMBOSS for quality conversion and gzip for obvious reasons.

The rebalancing of read pairs (sorting paired vs single reads after these steps) in this pipeline is a custom made script.

ADD COMMENT
0
Entering edit mode

Thanks a lot. I will look into the trimming applications you suggested.

ADD REPLY

Login before adding your answer.

Traffic: 2107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6