MD5 Hashing for Many Files
0
0
Entering edit mode
8.1 years ago
jiwpark00 ▴ 230

I was wondering, is there a good way to do MD5 hashing to check many files you download from NCBI (genomes, sequences, etc..)?

By "many", I mean that say you want to download 1,000 microarray data - is there a good way to do MD5 hashing to check for identity of the files?

I know you can write for loop for this but I wasn't sure if there was some clever way...

RNA-seq MD5 • 4.3k views
ADD COMMENT
2
Entering edit mode

gnu-parallel is a good friend for a bioinformaticion. You can fork multiple jobs very easily parallel "md5sum {}" ::: * Even on a desktop with multiple cores you can save a lot of time and the syntax is very convenient and avoids explicit looping

ADD REPLY
0
Entering edit mode

I see gnu parallel recommended many times here but how does that scale with disk I/O. OP has a 1000 files so surely this has the potential of saturating the I/O, if not invoked properly.

ADD REPLY
0
Entering edit mode

There are options to limit the parallel processing to a certain amount of jobs and/or memory usage depending on the limiting factor for your system.

ADD REPLY
0
Entering edit mode

Yes genomax2 it might saturate if invoked improperly. But like make it as a -j n option to limit number of jobs at a time to n. I usually use -j 2 or 4 on a 8 core system.

ADD REPLY
0
Entering edit mode

The best advice is to try and measure. For details see: https://oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster/

ADD REPLY
0
Entering edit mode

I apologize for my naiveness but how do I put multiple file directories inside {}?

In other words:

parallel "md5sum directory1/file1 directory1/file2 " :::* like that? Sorry I'm not familiar with parallel commands.

ADD REPLY
0
Entering edit mode

Another power tool is find, e.g.: find . -name '*.bam' | parallel -j 3 "md5sum {}"

In this case for bam files, but you easily customize this. And it's recursive, so it will search also in subdirectories for files matching ".bam". You can easily check it's output by redirecting the output of find to your terminal or a file `find . -name '*.bam' > temp.txt

ADD REPLY
2
Entering edit mode

You could calculate the md5sum while downloading the file, e.g.:

wget -O - http://www.address.org/file.ext | tee file.ext | md5sum > file.ext.md5

Then you could just compare the downloaded md5sum with the ones available at the site of origin.

ADD REPLY
1
Entering edit mode

BTW This is just to play around and I have not tested it for very large files or extensively. But if the downloads are done via rsync (not always possible) rsync can be made to report a md5sum in the logfile.

rsync  -a --log-file=x --out-format="%C" ../study.table .
eab8552ece4d7ed56082192e81bf0dd1

md5sum ../study.table 
eab8552ece4d7ed56082192e81bf0dd1

The %C will will report md5sums for files in rsync output. The study.table file was ~60M in size. More testing is needed to see if this can work consistently or is of any use.

ADD REPLY
0
Entering edit mode

If you had access to a cluster you could start those jobs in parallel but otherwise you are on the right track.

ADD REPLY
0
Entering edit mode

Thank you everyone! I appreciate the help greatly!

ADD REPLY

Login before adding your answer.

Traffic: 2827 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6