Read/Write Throughput Of Bioinformatics Storage
3
5
Entering edit mode
14.0 years ago

The growing amount of data in Bioinformatics makes read and write throughput an important detail of a computing infrastructure.

What read / write throughput do you currently have?

Which types Bioinformatics computations need high read / write throughput?

What storage technology do you recommend?

Measurements using Perl

For estimating your write throughput, I offer the following one-liners. They write and read 1GB of random data. It is I/O bound and it tries to minimize the effects of storage acceleration techniques (caching, compression, and deduplication).

Write measurement

# Generate 1GB of random numbers and write them to a file
perl -e 'print STDERR "Grenerating data...\n"; while (1) { print rand() }'  | \
dd count=2097152 bs=512 | perl -e '
# Accumulate and then Measure with dd
@all = ();

binmode(STDIN);

while (read(STDIN, $b, 1048576)) { # Read in 1 MB chunks
   push(@all, $b);
}
close(STDIN);

print STDERR "\nWriting data...\n";
open(DD, "|dd bs=512 of=bigfile-1gb");
binmode(DD);
foreach (@all) {
  print DD "$_";
}
close(DD);'
  • 1GB of free RAM is required. Otherwise, swapping will kill the measurement.
  • If you don't have 1GB free then please change count=2097152 to count=1048576

Cache confuser

# 200 times generate and delete 1GB files
for i in {1..200}; do echo "$i of 200"; dd count=2097152 bs=512 if=<(perl -e 'while (1) { $t=pack("N", int(rand(10**15))); for my $i(1..1000) {print $t}; print "\n"}') of=bigfile-1gb-decoy-$i 2> /dev/null; rm bigfile-1gb-decoy-$i;  done
  • This step is optional
  • This step will take a very long time (2 hours in my cases)
  • This step prevents the possibility of the read measurement to be affected by a caching (become overly optimistic)

Read measurement

# Read the generated file back
dd bs=512 if=bigfile-1gb of=/dev/null

Cleanup

rm bigfile-1gb

Results

We have an LSI 4600 san with Fibre 4 Gbps 15K RPM drives. These are attached to a dedicated 2.4GHz 4x core server running OpenSolaris and ZFS. The file system is server to 32 Linux compute nodes (256 cores total) via NFSv3. The cluster is under constant load. I made the following measurements on one of the compute nodes:

Write: 49 MB/s

Read: 169 MB/s (647 MB/s if cached)

data hardware • 5.6k views
ADD COMMENT
3
Entering edit mode

I don't see the relation to bioinformatics at all! Just because there has been large data from sequencing? This is generally the case in many IT appliances. So I suggest you better ask on serverfault.com how to run a propper benchmark. Your method seems pretty much flawed to me.

ADD REPLY
1
Entering edit mode

If you want a sophisticated benchmark then you can use Bonnie++. Not everyone will take the time to run it. What I wrote here is a sequential read/write measurement that aims to be very simple to use and interpret. I worked for 3 years with various DAS, NAS, and SAN systems but have not found the theoretical estimates to have anything to do with real storage throughput.

ADD REPLY
1
Entering edit mode

Now I am getting a bit confused. What do you mean by "take the time to run [bonnie++]"? It is just one line on the command prompt, for example: http://www.linux.com/archive/feature/139742 At the end of that article, there are also some references to other benchmark tools and the users weight their pros/cons.

ADD REPLY
0
Entering edit mode

Actually, why not simply ask people about the throughput they see on their systems on everyday workload?

Every benchmark will be biased towards certain circumstances and you will most likely not come up with a super-smart idea that is more reliable than the standard i/o benchmarks out there. In fact, you should be able to estimate the maximum i/o throughput of your system before you buy it, depending on disks and RAID-level used.

ADD REPLY
0
Entering edit mode

could someone edit the question title, it should be throughput not throughout shouldn't it?

ADD REPLY
0
Entering edit mode

Bonnie++ - you have to install it. You must be root. The installation command is different for different distros. My run almost anywhere (even on my Mac laptop). All you have to do is copy and paste.

ADD REPLY
0
Entering edit mode

My code provides 2 numbers: 1 for reading, 2 for writing. Bonnie++ provides 16 numbers for the sequential categories.

ADD REPLY
0
Entering edit mode

@Dave, Thanks! - please remove your comment.

ADD REPLY
0
Entering edit mode

@Dave, Thanks! I fixed it - please remove your comment.

ADD REPLY
0
Entering edit mode

My code provides 2 numbers: reading, writing. Bonnie++ provides 16 numbers for the sequential categories.

ADD REPLY
0
Entering edit mode

My code (above) provides only 2 numbers: write throughput, read throughput. Bonne++ provides 16 number for the Sequential category.

ADD REPLY
0
Entering edit mode

@Dave, Thanks! I fixed it - please remove your comment.

ADD REPLY
0
Entering edit mode

You mean bonnie++ is hard to install because you have to know how to use apt or yum? Would non-root users not first use iostat / sar to check the performance of the system at hand?

ADD REPLY
0
Entering edit mode

@Michael Dondrup, I have 3 years of experience working with HPC storage in a Bioinformatics setting. I'm interested in what the current and expected storage throughput for this particular community. I know that other areas of science have large storage demands (Physics, Radiology, and others) but I am not interested in those. Other fields handle their data differently. Bonnie++ is a good benchmark but I wanted something easy that any member of this community can launch in a minute. My code provides a reasonable measurement. It generates random data in RAM and dumps it to disk with dd.

ADD REPLY
6
Entering edit mode
14.0 years ago

I'm afraid that your attempt at benchmarking disk speed does not work as intended.

When I test the write speed on my RAID, it delivers 30.8MB/s. If I run the same test on /dev/shm (a RAM disk!) the speed is 31.2 MB/s. In both cases, the Perl process takes 100% CPU. In other words, the test is not I/O bound, but is in fact limited by CPU speed and hence does not test the speed of the disk.

If I instead use this command to test the write-speed, I get much more sensible results:

dd count=2097152 bs=512 if=/dev/zero of=bigfile-1g

On my RAID system consiting of six 7200rpm SATA disks, I get 246 MB/s whereas the RAM disk delivers 658 MB/s.

ADD COMMENT
1
Entering edit mode

@Lars Juhl Jensen, 172 MB/s write is probably more reasonable. It's actually not a bad throughput compared to most of my systems. But 1.1 GB/s read is surprising. I don't think that would be possible even with SSD. I'm pretty sure the 1GB somehow got back-into the cache. Can you try to run just the write and the read measurements without the cache confuser but in make the file size a large % of your free RAM? For example (10GB -> https://gist.github.com/791799), (20GB -> https://gist.github.com/791802), or (30 GB -> https://gist.github.com/791803).

ADD REPLY
0
Entering edit mode

You are right about my code being CPU bound. By writing 0s is subject to compression. We have compression enabled via ZFS so using your dd command I'm getting: 126 MB/s, ram disk delivered 354 MB/s. I'm currently rewriting my command to first accumulate the 1GB in memory and then to write it all one at once.

ADD REPLY
0
Entering edit mode

You are right about my code being CPU bound. But for your command, writing 0s is subject to compression. We have compression enabled via ZFS so using your dd command I'm getting: 126 MB/s, ram disk delivered 354 MB/s. I'm currently rewriting my command to first accumulate the 1GB in memory and then to write it all one at once.

ADD REPLY
0
Entering edit mode

You are right about my code being CPU bound. But for your command, writing 0s is subject to compression. We have compression enabled via ZFS so using your dd command I'm getting: 126 MB/s, ram disk delivered 354 MB/s. I'm currently rewriting my command to first accumulate the 1GB in memory and then to write it all to disk.

ADD REPLY
0
Entering edit mode

Ok, I'm running an uncompressed filesystem, so performance-wise it really shouldn't matter if I write random numbers or zeros.

ADD REPLY
0
Entering edit mode

Lars, I updated my code. Could you please see if you get similar write results with my new code as compared to your if=/dev/zero method?

ADD REPLY
0
Entering edit mode

With your new code, I get an estimate of 172 MB/s write. However, that is still in the low end. If I generate a 1GB file with random numbers on /dev/shm and do a dd of the data from the RAM disk to the RAID disk, I get 220 MB/s write.

ADD REPLY
0
Entering edit mode

The read performance that I get is 1.1 GB/s. However, this number is constant irrespective of whether I run your cache confuser or not, so I am sure it is due to caching.

ADD REPLY
5
Entering edit mode
13.8 years ago
Farhat ★ 2.9k

For measuring your disk IO throughput, it would be better to use one of the more professionally developed tools for this purpose. One open source tool for this purpose is iozone http://iozone.org/ which is really good.

ADD COMMENT
1
Entering edit mode
13.1 years ago
Kevin ▴ 640

In Ubuntu there's a GUI under System -> Adminstration -> disk utility that is installed by default. U need empty disk to try write speeds though.

ADD COMMENT

Login before adding your answer.

Traffic: 2664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6