Hi
Like many that work with NGS data, I have a large storage server and a cluster for computation. However, I am asked not to leave all the files on the computation cluster as it is shared and with large but limited storage space. The link between storage/computation is up to 100MB/s. I sometimes have to copy around TB of data. I am using rsync, but with this kind of file size it has some limitations.
Mainly, when I want to "sync" two directories, even if many files are already present in both source and destination, it takes considerable time to compare the two. If the connection fall just before end of transfer, it might take up to half as much to re-sync.
What do you use to move around very large files? Is there a way to tell rsync to locally store some sort of label for successfully transferred files (not to look check that the two files are the same bit by bit or block by block) and skip them straight away?
rsync is otherwise very nice tool. It works over ssh, it let me cap the bandwith (to avoid saturation) and other nice options.
time rsync --verbose --progress --stats --rsh="/usr/bin/ssh -c arcfour" --recursive source.bam $mac:/tmp
building file list ...
1 file to consider
source.bam
1017742131 100% 67.02MB/s 0:00:14 (xfer#1, to-check=0/1)
Number of files: 1
Number of files transferred: 1
Total file size: 1017742131 bytes
Total transferred file size: 1017742131 bytes
Literal data: 1017742131 bytes
Matched data: 0 bytes
File list size: 45
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 1017866474
Total bytes received: 42
sent 1017866474 bytes received 42 bytes 61688879.76 bytes/sec
total size is 1017742131 speedup is 1.00
real 0m15.624s
user 0m9.822s
sys 0m4.545s
time rsync --verbose --progress --stats --rsh="/usr/bin/ssh -c arcfour" --recursive source.bam destination:/tmp
building file list ...
1 file to consider
source.bam
1017742131 100% 171.02MB/s 0:00:05 (xfer#1, to-check=0/1)
Number of files: 1
Number of files transferred: 1
Total file size: 1017742131 bytes
Total transferred file size: 1017742131 bytes
Literal data: 0 bytes
Matched data: 1017742131 bytes
File list size: 45
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 127739
Total bytes received: 223405
sent 127739 bytes received 223405 bytes 16332.28 bytes/sec
total size is 1017742131 speedup is 2898.36
real 0m21.620s
user 0m4.780s
sys 0m0.846s
Thanks
sorry, I read the help as as "skip [check]" instead of "skip [files]". How about: --size-only "skip files that match in size"?
Have you tried the "-c" option: "skip based on checksum, not mod-time & size"
Thanks, but from the man page: "This forces the sender to checksum every regular file using a 128-bit MD4 checksum." So I definetely DO NOT want this. I think rsync does something more than just checking mod-time & size...
what about mounting a folder from the storage server on the cluster?
@Giovanni. Not really an option, and out of my "power". There is also a technical reason: the storage attached to the computing cluster has very high performance (3.2GB/s) to avoid being the bottleneck. Files have to be moved there first.