Question

Medium Sized Data Backup Strategies

7

Entering edit mode

13.8 years ago

Niallhaslam 2.3k

Hi all,

I'm interested to hear people's perspectives on backing up the contents of their groups workstations, laptops etc in some kind of co-ordinated way. Hopefully on something that could be applied to a medium sized group. My group is a mix of bioinformatics and wet lab scientists. The work in the lab is backed up on servers - i.e. the raw data. The big analysis stuff is all done on clusters and HPC so that is well backed up as well. But the gap that exists is for the like of group presentations, small analysis and smaller projects. Code is backed up using version control.

I know there is a lot of chat about NGS and the storage requirements there, but this is a different problem -that has probably been solved before however I feel its worth revisiting to see if anyone has found an easier solution. In previous jobs I've just rsync'd /home/ to an offsite computer and forgot about it however in a heterogenous work environment its not possible to do this for the whole group which I would like to be able to do. Like I say the mission critical stuff is pretty recoverable but I would like to increase the recoverability of the rest.

First off the specs - 10-15 users. Each with say 100Gigs of random datasets, code, analysis, presentations, papers, manuscripts etc that may lurk on a mix of Windows, Mac and Linux laptops and desktops. Has anyone any experience setting up a Network Attached Storage (NAS) system for example where all users could read/write to central NAS server? Any pitfalls?

What backup systems have others in place for disaster recovery of their workstations/laptops?

Has anyone successfully implemented a group based strategy for a heterogenous work environment in a uni setting (i.e. without paying through the nose).

Currently looking at buying a NAS and placing it in a building at the other end of campus and filling it with traditional HDs. Synology, Drobo would be examples of what I mean.

I should say I back up my own stuff daily - but this is about having something stable for the group, who aren't as paranoid about data plans. I went to Uni in Southampton when one of the Comp Sci buildings went down: [http://www.ecs.soton.ac.uk/podcasts/video.php?id=46][BBC]

data • 5.8k views

ADD COMMENT • link updated 13.7 years ago by Giovanni M Dall'Olio 28k • written 13.8 years ago by Niallhaslam 2.3k

0

Entering edit mode

When you've had a failure is not the time you want to find out your backups weren't functioning correctly.

No matter what you choose, test it periodically.

Backups are easy, restores are hard.

ADD REPLY • link 13.8 years ago by Gareth Palidwor ★ 1.6k

0

Entering edit mode

I just wanted to add that I haven't chosen a right answer yet as all of the suggested solutions solve different aspects of the problem very well.

ADD REPLY • link 13.7 years ago by Niallhaslam 2.3k

score 8 · Answer 1 · 2011-02-23

8

Entering edit mode

13.8 years ago

Istvan Albert 102k

I found DropBox as an ideal and non-intrusive way to back up relatively small fragmented datasets distributed over a wide variety of platforms.

Each user sets up their own dropbox instance and makes sure to save the data that needs to be backed up into a filepath that is monitored by it. For larger datasets a dedicated solution is needed, but those never work well for lots of small fragmented pieces of information.

One important (but often forgotten) aspect of centralized large scale backups is that of privacy. Who can see and recover some of the information that you may not want to be accessible to others.

ADD COMMENT • link 13.8 years ago by Istvan Albert 102k

3

Entering edit mode

Dropbox is not limited to 2GB per user. That's all you will get for Free. Storage of any quantity, tends not to be free.

ADD REPLY • link 13.8 years ago by User 59 13k

1

Entering edit mode

Dropbox seems to be limited to 2Gb per user. I use it for some stuff already and like it.

The point about privacy (permissions) is very important though, and difficult to solve for heterogenous computing labs. Makes me nostalgic for the pure linux days!

ADD REPLY • link 13.8 years ago by Niallhaslam 2.3k

score 6 · Answer 2 · 2011-02-23

Centralized

Having everyone log-in to one central Biocluster with a dedicated NAS has many benefits:

One common environment to learn
More software ready to use
One issue = one troubleshooting
Getting a new user started takes minutes
All data is in one place

This last point makes it easy to backup data securely and efficiently. A simple rsync to a heavily fire-walled off-site computer with cheap DAS storage shelves will be good enough. Here is an old but well tested Howto: Easy Automated Snapshot-Style Backups with Linux and Rsync

The rsync approach may not be the most efficient way because it needs to go through your entire dataset every night to figure out the differences. Related operations on snapshots also take a long time. On the other hand is a Copy-on-write transactional filesystem like ZFS (see ZFS for NGS data analysis). ZFS already knows what changed throughout the day - to make a backup it just needs to replay the log. Things are a little more complicated with ZFS because it does not run on Linux yet. I'm looking into switch to ZFS backups anyway because the hardware that will send and receive the backups will not be running anything else. For those parts of the infrastructure I can choose any operating system (OpenSolaris, OpenIndiana, or FreeBSD).

Two pitfalls that I encountered and solved when using a NAS:

Adding workstations to the compute infrastructure is a bad idea. It's OK if the compute nodes freeze for an hour when incidents occur (e.g. while the NFS server is fixed and rebooted) - nobody will even notice. On the other hand, if you have workstations on the same NAS then the lab will be completely paralyzed (even all web-browsers will freeze). There are several other reasons to keep the workstation off the NAS. Get everyone used to keeping the data on the central compute server.
Running NFS server on the head-node is also a bad idea. You should have a dedicated hardware for NFS. The performance will go up. The number of freez-ups will go down. Fixing things will be easier. You will have more flexibility. You can run the backup scripts from this server.

Decentralized

If you cannot avoid an environment where the data is spread across many workstations and laptops then I don't know any good solutions.

HashBackup would have been a good solution if it was usable. Unfortunately it's in beta which expires every 4 month.

HashBackup encrypts the data so would have when well with what Istvan said about privacy. Unfortunately, it's not open source. I haven't started used it for the openness reason when I first learned about it 1 year ago. I don't trust single-person projects that are not open source. After all, it was a good choice not to trust it because the developer started doing the 4 month expiration trick.

It's a good approach. Perhaps there are similar commercial solutions that do this (maybe the one that @Brad uses).

score 4 · Answer 3 · 2011-02-23

4

Entering edit mode

13.8 years ago

Brad Chapman 9.7k

Our department uses Crashplan:

http://www.crashplan.com/

Setup and installation is simple and there are clients for Windows, Mac and Linux. It's not free, but pricing is reasonable.

ADD COMMENT • link 13.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

Its not free for the cloud backups - but from looking at the site it seems to suggest that mirroring of computers to each other or to a NAS should be free? Is that right - if so sounds great!

ADD REPLY • link 13.8 years ago by Niallhaslam 2.3k

0

Entering edit mode

I think it is free to get started backing up locally. We run CrashPlanPro so I don't have a lot of experience with all the functionality of the free version, but if it is anything like Pro all runs smoothly and easily. It sounds like you could get started with the free version and see if it works for you; then you'll have a better idea if Pro makes sense. Glad this helps.

ADD REPLY • link 13.8 years ago by Brad Chapman 9.7k

score 3 · Answer 4 · 2011-02-23

3

Entering edit mode

13.8 years ago

User 59 13k

After having got a bit cheesed off with flaky RAID systems (in the 3-10TB range) and frustratingly complex tape backup systems we went out and bought 2x20TB Viglen RAID systems, plenty of room for expansion.

RAID is not backup, but effectively the second RAID unit is the disk equivalent of a tape library. Learned lessons with RAID5 systems, so these are RAID6 with hot spares.

backup2l provides an incredibly simple backup solution between the two machines. Currently working a treat and a weight off my mind.

ADD COMMENT • link 13.8 years ago by User 59 13k

0

Entering edit mode

+1 for RAID6 with hot spares

ADD REPLY • link 13.8 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

+1 for mentioning that RAID is not necessarily a backup solution. Explanation: if you delete a file, it gets deleted from all mirrored disks.

ADD REPLY • link 13.8 years ago by Michael Schubert ★ 7.1k

score 2 · Answer 5 · 2011-02-23

For backup of ~10 Linux and Mac laptops & workstations in our group (our central admin backs up our cluster and anything on it to tape), we use a Synology DS1010 base + DX510 expansion unit, each with 5 x 2Tb SATA HDDs. We have the base and expansion unit configured as separate volumes, since if configured as one single volume, failure of one device crashes the other. When configured as RAID5, each volume has 7.15 TB capacity, So the whole unit has 14.3 Tb for about £1800 total from microdirect.co.uk.

In terms of set up and use the Synology is extremely easy. We had it up and running in an hour out of the box, and it can be mounted using NFS easily as well as using AFP from Mac's (it has support for Windows too, but I haven't tried this yet). Back-ups are done via rsync. No troubles with this device since Aug 2010. So in terms of price, stability and ease of use, I can certainly recommend this unit for a cheap NAS back-up system on the scale you are interested.

score 1 · Answer 6 · 2011-02-23

We keep a copy of all the code on bitbucket repositories; it is good because there are no disk limit space, but I don't recommend it for very big files.

The data, sad but true, is backup on external hard drives (I make incremental backups every day) and on a cluster.

Since we do not work with huge data, until the scripts are stored on a remote repository and we can access them from anywhere, we can redownload or recreate any result starting from there.