Hi everybody,
I would like to open a discussion about storage solutions that are being used in different genomic research centers. I'll start with our case and why I'd like to know what solutions are other people using around the world.
We have 6 Illumina HiSeq NGS machines and 2 MiSeq. As you well know, these machines generate larges amounts of data per day. But not only the amount of data (around a TB per day) is the technological challenge, but also the structure of this data. Generally a sequencing experiment result consists on thousands of small files (images, control files, stats files, etc). Until a few months ago, we were using a traditional file system to store our sequencing data. The data was continuously transferred to the analysis machines in order to be processed using a lsyncd daemon. The problem with this approach is that we need to be continuously transferring the data, we don't have a central location for the data, and there is no data reliability.
In order to solve this problems, we started using a distributed file system, MooseFs. Unfortunately we're experiencing unacceptable transfer rates (around 1GB/hour) when transferring sequencing data. I've been doing some research, and indeed these kind of file systems are not optimised for large amounts of small files. In fact, transferring a tarball of 8GB takes only about 2 mins.
So, I'd like to know what you are using in your centers:
- What solutions are you using? In terms of file systems specially.
- Have you experienced the same problems with any other or the same parallel file system?
- Are you using parallel file systems at all?
- How do you transfer and keep in synch your data between machines?
You're very welcome to add or ask for any other information. Let's discuss!
Thanks everybody in advance!
P.S: Some related posts that didn't clarify much for me:
Why have ~15PB of storage servicing 30+ Illumina HiSeqs (and a variety of other platforms). We use GPFS mostly.
Hi Malachi! 30+ Illumina HiSeqs? That sound terrifying in data generation terms! How does GPFS perform? Do you have your machines configured to write directly there, or you use it as a backup solution? Could you please provide some numbers and a bit of information of your infrastructure? I really appreciate that. Thanks!