Dear Biostars-Community,
our beautiful SLURM-Cluster is attached to NAS-System (TrueNAS Scale, which uses ZFS under the hood). To keep scientific life as clean as possible, we are using a shared file system to provide project related data as well as home directories (all SLURM nodes run on Ubuntu 22.04 LTS). This is where the fun begins. Currently, we are mounting remote file systems using the SMB protocol, which is horribly slow when it comes to writing and reading a large amount of small files (installing conda took an overwhelming 12 min!). We already tuned it to the best of our knowledge, but still not as performant as we would like it to be (especially given the fact that our internal network runs 10 Gbit/s or more). We also tried NFS, same issue. To take the distributed nature of a cluster system into account, we also experimented with CEPH as well as GlusterFS. Since these systems are distribuited, our overall storage capacity wil be diminished (solvable problem!), but and this is what surprised me, neither CEPH nor GlusterFS was faster by an outstanding magnitude. Since filesystems are not my forte, I would not be surprised to have overlooked something. Any suggestions on that topic from your side?
Thanks in advance and cheers!
Sounds to me like there is some kind of hardware bottleneck. Is 10G ethernet being used end to end? Are the file systems in the same VLAN and/or are they being scanned by a deep packet inspection device? If yes, then getting them exempted from scans would lead to performance improvement.
Ultimately it may be your NAS hardware that is the issue. If the storage head nodes are tapped out in terms of performance getting larger nodes may be the only solution.
I think these issues are best addressed by local people who had already done similar things. It is something to know and plan ahead of time rather than hoping for an advice from strangers on the internet.