The growing amount of data in Bioinformatics makes read and write throughput an important detail of a computing infrastructure.
What read / write throughput do you currently have?
Which types Bioinformatics computations need high read / write throughput?
What storage technology do you recommend?
Measurements using Perl
For estimating your write throughput, I offer the following one-liners. They write and read 1GB of random data. It is I/O bound and it tries to minimize the effects of storage acceleration techniques (caching, compression, and deduplication).
Write measurement
# Generate 1GB of random numbers and write them to a file
perl -e 'print STDERR "Grenerating data...\n"; while (1) { print rand() }' | \
dd count=2097152 bs=512 | perl -e '
# Accumulate and then Measure with dd
@all = ();
binmode(STDIN);
while (read(STDIN, $b, 1048576)) { # Read in 1 MB chunks
push(@all, $b);
}
close(STDIN);
print STDERR "\nWriting data...\n";
open(DD, "|dd bs=512 of=bigfile-1gb");
binmode(DD);
foreach (@all) {
print DD "$_";
}
close(DD);'
- 1GB of free RAM is required. Otherwise, swapping will kill the measurement.
- If you don't have 1GB free then please change
count=2097152
tocount=1048576
Cache confuser
# 200 times generate and delete 1GB files
for i in {1..200}; do echo "$i of 200"; dd count=2097152 bs=512 if=<(perl -e 'while (1) { $t=pack("N", int(rand(10**15))); for my $i(1..1000) {print $t}; print "\n"}') of=bigfile-1gb-decoy-$i 2> /dev/null; rm bigfile-1gb-decoy-$i; done
- This step is optional
- This step will take a very long time (2 hours in my cases)
- This step prevents the possibility of the read measurement to be affected by a caching (become overly optimistic)
Read measurement
# Read the generated file back
dd bs=512 if=bigfile-1gb of=/dev/null
Cleanup
rm bigfile-1gb
Results
We have an LSI 4600 san with Fibre 4 Gbps 15K RPM drives. These are attached to a dedicated 2.4GHz 4x core server running OpenSolaris and ZFS. The file system is server to 32 Linux compute nodes (256 cores total) via NFSv3. The cluster is under constant load. I made the following measurements on one of the compute nodes:
Write: 49 MB/s
Read: 169 MB/s (647 MB/s if cached)
I don't see the relation to bioinformatics at all! Just because there has been large data from sequencing? This is generally the case in many IT appliances. So I suggest you better ask on serverfault.com how to run a propper benchmark. Your method seems pretty much flawed to me.
If you want a sophisticated benchmark then you can use Bonnie++. Not everyone will take the time to run it. What I wrote here is a sequential read/write measurement that aims to be very simple to use and interpret. I worked for 3 years with various DAS, NAS, and SAN systems but have not found the theoretical estimates to have anything to do with real storage throughput.
Now I am getting a bit confused. What do you mean by "take the time to run [bonnie++]"? It is just one line on the command prompt, for example: http://www.linux.com/archive/feature/139742 At the end of that article, there are also some references to other benchmark tools and the users weight their pros/cons.
Actually, why not simply ask people about the throughput they see on their systems on everyday workload?
Every benchmark will be biased towards certain circumstances and you will most likely not come up with a super-smart idea that is more reliable than the standard i/o benchmarks out there. In fact, you should be able to estimate the maximum i/o throughput of your system before you buy it, depending on disks and RAID-level used.
could someone edit the question title, it should be throughput not throughout shouldn't it?
Bonnie++ - you have to install it. You must be root. The installation command is different for different distros. My run almost anywhere (even on my Mac laptop). All you have to do is copy and paste.
My code provides 2 numbers: 1 for reading, 2 for writing. Bonnie++ provides 16 numbers for the sequential categories.
@Dave, Thanks! - please remove your comment.
@Dave, Thanks! I fixed it - please remove your comment.
My code provides 2 numbers: reading, writing. Bonnie++ provides 16 numbers for the sequential categories.
My code (above) provides only 2 numbers: write throughput, read throughput. Bonne++ provides 16 number for the Sequential category.
@Dave, Thanks! I fixed it - please remove your comment.
You mean bonnie++ is hard to install because you have to know how to use apt or yum? Would non-root users not first use iostat / sar to check the performance of the system at hand?
@Michael Dondrup, I have 3 years of experience working with HPC storage in a Bioinformatics setting. I'm interested in what the current and expected storage throughput for this particular community. I know that other areas of science have large storage demands (Physics, Radiology, and others) but I am not interested in those. Other fields handle their data differently. Bonnie++ is a good benchmark but I wanted something easy that any member of this community can launch in a minute. My code provides a reasonable measurement. It generates random data in RAM and dumps it to disk with dd.