An Alternative To Rsync (Or Useful Parameters)
4
6
Entering edit mode
13.2 years ago

Hi

Like many that work with NGS data, I have a large storage server and a cluster for computation. However, I am asked not to leave all the files on the computation cluster as it is shared and with large but limited storage space. The link between storage/computation is up to 100MB/s. I sometimes have to copy around TB of data. I am using rsync, but with this kind of file size it has some limitations.

Mainly, when I want to "sync" two directories, even if many files are already present in both source and destination, it takes considerable time to compare the two. If the connection fall just before end of transfer, it might take up to half as much to re-sync.

What do you use to move around very large files? Is there a way to tell rsync to locally store some sort of label for successfully transferred files (not to look check that the two files are the same bit by bit or block by block) and skip them straight away?

rsync is otherwise very nice tool. It works over ssh, it let me cap the bandwith (to avoid saturation) and other nice options.

time rsync --verbose --progress --stats --rsh="/usr/bin/ssh -c arcfour" --recursive    source.bam $mac:/tmp
building file list ...
1 file to consider
source.bam 
  1017742131 100%   67.02MB/s    0:00:14 (xfer#1, to-check=0/1)

Number of files: 1
Number of files transferred: 1
Total file size: 1017742131 bytes
Total transferred file size: 1017742131 bytes
Literal data: 1017742131 bytes
Matched data: 0 bytes
File list size: 45
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 1017866474
Total bytes received: 42   

sent 1017866474 bytes  received 42 bytes  61688879.76 bytes/sec
total size is 1017742131  speedup is 1.00
real    0m15.624s
user    0m9.822s
sys     0m4.545s

time rsync --verbose --progress --stats --rsh="/usr/bin/ssh -c arcfour" --recursive source.bam  destination:/tmp
building file list ... 
1 file to consider
source.bam
  1017742131 100%  171.02MB/s    0:00:05 (xfer#1, to-check=0/1)

Number of files: 1
Number of files transferred: 1
Total file size: 1017742131 bytes
Total transferred file size: 1017742131 bytes
Literal data: 0 bytes
Matched data: 1017742131 bytes
File list size: 45
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 127739
Total bytes received: 223405

sent 127739 bytes  received 223405 bytes  16332.28 bytes/sec
total size is 1017742131  speedup is 2898.36

real    0m21.620s
user    0m4.780s
sys     0m0.846s

Thanks

next-gen-sequencing • 11k views
ADD COMMENT
2
Entering edit mode

sorry, I read the help as as "skip [check]" instead of "skip [files]". How about: --size-only "skip files that match in size"?

ADD REPLY
0
Entering edit mode

Have you tried the "-c" option: "skip based on checksum, not mod-time & size"

ADD REPLY
0
Entering edit mode

Thanks, but from the man page: "This forces the sender to checksum every regular file using a 128-bit MD4 checksum." So I definetely DO NOT want this. I think rsync does something more than just checking mod-time & size...

ADD REPLY
0
Entering edit mode

what about mounting a folder from the storage server on the cluster?

ADD REPLY
0
Entering edit mode

@Giovanni. Not really an option, and out of my "power". There is also a technical reason: the storage attached to the computing cluster has very high performance (3.2GB/s) to avoid being the bottleneck. Files have to be moved there first.

ADD REPLY
7
Entering edit mode
13.2 years ago

Rsync normally does a quick check, based on mod-time and size. If these don't match, or the '-c' option is used, it will do a full checksum. The full checksum is painfully slow.

First, you aren't syncing timestamps, so the mod-time check will always fail, forcing a full checksum. You must specify "--times". I usually use "-a/--archive", which means "-rlptgoD".

If either endpoint uses a Windows filesystem, which represents times with a 2-second resolution, this often causes the timestamp check to fail, forcing the full checksum. Use "--modify-window=1" to resolve this.

If syncing timestamps is not possible or undesirable for other reasons, try "--size-only", though this will potentially miss some changes. I wouldn't recommend it.

ADD COMMENT
0
Entering edit mode

Thanks a lot! the --times flag seems to do the trick.

ADD REPLY
4
Entering edit mode
13.2 years ago

For very large files there is bbcp. And for two way sync I use unison. Whilst neither completely solve your specific problem - they might offer some ideas about how to solve it differently.

ADD COMMENT
2
Entering edit mode
13.2 years ago

We don't use it in our group, but if network is an issue, you might consider a UDP-based transfer like aspera. I have no idea how expensive such a solution is, but it (aspera) is used by NCBI and EBI. I have used it in the context of sending files to SRA and found that it has behavior similar to rsync, but my intuition is that it is faster. I'd be really curious to hear about others using similar tools or even aspera to deal with the data transfer issues that arise in these situations.

ADD COMMENT
0
Entering edit mode

+1 for Aspera, excellent tool bigdata transfer.

ADD REPLY
0
Entering edit mode
13.2 years ago
Chris ★ 1.6k

I've once encountered the same problem with rsync. I switched to a different setup of data sharing between compute nodes: I store data into a MySQL database which is used by each node for reading and writing the data. Thus, I'm able to analyse data 'life' while nodes are still busy with computing. Administrative care has to be taken however when a lot of access happens simultaneously to the MySQL server.

ADD COMMENT
0
Entering edit mode

by "data" do you mean the actual file content? Or just metadata about file size/checksum/modification time? If the former, I doubt I can apply to my TB of data...

ADD REPLY

Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6