Measurements using Perl

Question

Read/Write Throughput Of Bioinformatics Storage

5

Entering edit mode

14.3 years ago

Aleksandr Levchuk 3.2k

The growing amount of data in Bioinformatics makes read and write throughput an important detail of a computing infrastructure.

What read / write throughput do you currently have?

Which types Bioinformatics computations need high read / write throughput?

What storage technology do you recommend?

Measurements using Perl

For estimating your write throughput, I offer the following one-liners. They write and read 1GB of random data. It is I/O bound and it tries to minimize the effects of storage acceleration techniques (caching, compression, and deduplication).

Write measurement

# Generate 1GB of random numbers and write them to a file
perl -e 'print STDERR "Grenerating data...\n"; while (1) { print rand() }'  | \
dd count=2097152 bs=512 | perl -e '
# Accumulate and then Measure with dd
@all = ();

binmode(STDIN);

while (read(STDIN, $b, 1048576)) { # Read in 1 MB chunks
   push(@all, $b);
}
close(STDIN);

print STDERR "\nWriting data...\n";
open(DD, "|dd bs=512 of=bigfile-1gb");
binmode(DD);
foreach (@all) {
  print DD "$_";
}
close(DD);'

1GB of free RAM is required. Otherwise, swapping will kill the measurement.
If you don't have 1GB free then please change count=2097152 to count=1048576

Cache confuser

# 200 times generate and delete 1GB files
for i in {1..200}; do echo "$i of 200"; dd count=2097152 bs=512 if=<(perl -e 'while (1) { $t=pack("N", int(rand(10**15))); for my $i(1..1000) {print $t}; print "\n"}') of=bigfile-1gb-decoy-$i 2> /dev/null; rm bigfile-1gb-decoy-$i;  done

This step is optional
This step will take a very long time (2 hours in my cases)
This step prevents the possibility of the read measurement to be affected by a caching (become overly optimistic)

Read measurement

# Read the generated file back
dd bs=512 if=bigfile-1gb of=/dev/null

Cleanup

rm bigfile-1gb

Results

We have an LSI 4600 san with Fibre 4 Gbps 15K RPM drives. These are attached to a dedicated 2.4GHz 4x core server running OpenSolaris and ZFS. The file system is server to 32 Linux compute nodes (256 cores total) via NFSv3. The cluster is under constant load. I made the following measurements on one of the compute nodes:

Write: 49 MB/s

Read: 169 MB/s (647 MB/s if cached)

data hardware • 6.2k views

ADD COMMENT • link updated 14.1 years ago by Farhat ★ 2.9k • written 14.3 years ago by Aleksandr Levchuk 3.2k

3

Entering edit mode

I don't see the relation to bioinformatics at all! Just because there has been large data from sequencing? This is generally the case in many IT appliances. So I suggest you better ask on serverfault.com how to run a propper benchmark. Your method seems pretty much flawed to me.

ADD REPLY • link 14.3 years ago by Michael 55k

1

Entering edit mode

If you want a sophisticated benchmark then you can use Bonnie++. Not everyone will take the time to run it. What I wrote here is a sequential read/write measurement that aims to be very simple to use and interpret. I worked for 3 years with various DAS, NAS, and SAN systems but have not found the theoretical estimates to have anything to do with real storage throughput.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

1

Entering edit mode

Now I am getting a bit confused. What do you mean by "take the time to run [bonnie++]"? It is just one line on the command prompt, for example: http://www.linux.com/archive/feature/139742 At the end of that article, there are also some references to other benchmark tools and the users weight their pros/cons.

ADD REPLY • link 14.3 years ago by Joachim ★ 2.9k

0

Entering edit mode

Actually, why not simply ask people about the throughput they see on their systems on everyday workload?

Every benchmark will be biased towards certain circumstances and you will most likely not come up with a super-smart idea that is more reliable than the standard i/o benchmarks out there. In fact, you should be able to estimate the maximum i/o throughput of your system before you buy it, depending on disks and RAID-level used.

ADD REPLY • link 14.3 years ago by Joachim ★ 2.9k

0

Entering edit mode

could someone edit the question title, it should be throughput not throughout shouldn't it?

ADD REPLY • link 14.3 years ago by Dave Lunt ★ 2.0k

0

Entering edit mode

Bonnie++ - you have to install it. You must be root. The installation command is different for different distros. My run almost anywhere (even on my Mac laptop). All you have to do is copy and paste.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

My code provides 2 numbers: 1 for reading, 2 for writing. Bonnie++ provides 16 numbers for the sequential categories.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

@Dave, Thanks! - please remove your comment.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

@Dave, Thanks! I fixed it - please remove your comment.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

My code provides 2 numbers: reading, writing. Bonnie++ provides 16 numbers for the sequential categories.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

My code (above) provides only 2 numbers: write throughput, read throughput. Bonne++ provides 16 number for the Sequential category.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

@Dave, Thanks! I fixed it - please remove your comment.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

You mean bonnie++ is hard to install because you have to know how to use apt or yum? Would non-root users not first use iostat / sar to check the performance of the system at hand?

ADD REPLY • link 14.3 years ago by Joachim ★ 2.9k

0

Entering edit mode

@Michael Dondrup, I have 3 years of experience working with HPC storage in a Bioinformatics setting. I'm interested in what the current and expected storage throughput for this particular community. I know that other areas of science have large storage demands (Physics, Radiology, and others) but I am not interested in those. Other fields handle their data differently. Bonnie++ is a good benchmark but I wanted something easy that any member of this community can launch in a minute. My code provides a reasonable measurement. It generates random data in RAM and dumps it to disk with dd.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

Ram · Answer 1 · 2011-01-21

6

Entering edit mode

14.3 years ago

Lars Juhl Jensen 11k

I'm afraid that your attempt at benchmarking disk speed does not work as intended.

When I test the write speed on my RAID, it delivers 30.8MB/s. If I run the same test on /dev/shm (a RAM disk!) the speed is 31.2 MB/s. In both cases, the Perl process takes 100% CPU. In other words, the test is not I/O bound, but is in fact limited by CPU speed and hence does not test the speed of the disk.

If I instead use this command to test the write-speed, I get much more sensible results:

dd count=2097152 bs=512 if=/dev/zero of=bigfile-1g

On my RAID system consiting of six 7200rpm SATA disks, I get 246 MB/s whereas the RAM disk delivers 658 MB/s.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.3 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

@Lars Juhl Jensen, 172 MB/s write is probably more reasonable. It's actually not a bad throughput compared to most of my systems. But 1.1 GB/s read is surprising. I don't think that would be possible even with SSD. I'm pretty sure the 1GB somehow got back-into the cache. Can you try to run just the write and the read measurements without the cache confuser but in make the file size a large % of your free RAM? For example (10GB -> https://gist.github.com/791799), (20GB -> https://gist.github.com/791802), or (30 GB -> https://gist.github.com/791803).

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

You are right about my code being CPU bound. By writing 0s is subject to compression. We have compression enabled via ZFS so using your dd command I'm getting: 126 MB/s, ram disk delivered 354 MB/s. I'm currently rewriting my command to first accumulate the 1GB in memory and then to write it all one at once.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

You are right about my code being CPU bound. But for your command, writing 0s is subject to compression. We have compression enabled via ZFS so using your dd command I'm getting: 126 MB/s, ram disk delivered 354 MB/s. I'm currently rewriting my command to first accumulate the 1GB in memory and then to write it all one at once.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

You are right about my code being CPU bound. But for your command, writing 0s is subject to compression. We have compression enabled via ZFS so using your dd command I'm getting: 126 MB/s, ram disk delivered 354 MB/s. I'm currently rewriting my command to first accumulate the 1GB in memory and then to write it all to disk.

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

Ok, I'm running an uncompressed filesystem, so performance-wise it really shouldn't matter if I write random numbers or zeros.

ADD REPLY • link 14.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Lars, I updated my code. Could you please see if you get similar write results with my new code as compared to your if=/dev/zero method?

ADD REPLY • link 14.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

With your new code, I get an estimate of 172 MB/s write. However, that is still in the low end. If I generate a 1GB file with random numbers on /dev/shm and do a dd of the data from the RAM disk to the RAID disk, I get 220 MB/s write.

ADD REPLY • link 14.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

The read performance that I get is 1.1 GB/s. However, this number is constant irrespective of whether I run your cache confuser or not, so I am sure it is due to caching.

ADD REPLY • link 14.3 years ago by Lars Juhl Jensen 11k

score 5 · Answer 2 · 2011-03-28

5

Entering edit mode

14.1 years ago

Farhat ★ 2.9k

For measuring your disk IO throughput, it would be better to use one of the more professionally developed tools for this purpose. One open source tool for this purpose is iozone http://iozone.org/ which is really good.

ADD COMMENT • link 13.4 years ago by Farhat ★ 2.9k

score 1 · Answer 3 · 2011-12-09

1

Entering edit mode

13.4 years ago

Kevin ▴ 640

In Ubuntu there's a GUI under System -> Adminstration -> disk utility that is installed by default. U need empty disk to try write speeds though.

ADD COMMENT • link 13.4 years ago by Kevin ▴ 640