Entering edit mode
8.1 years ago
jiwpark00
▴
230
I was wondering, is there a good way to do MD5 hashing to check many files you download from NCBI (genomes, sequences, etc..)?
By "many", I mean that say you want to download 1,000 microarray data - is there a good way to do MD5 hashing to check for identity of the files?
I know you can write for loop for this but I wasn't sure if there was some clever way...
gnu-parallel is a good friend for a bioinformaticion. You can fork multiple jobs very easily
parallel "md5sum {}" ::: *
Even on a desktop with multiple cores you can save a lot of time and the syntax is very convenient and avoids explicit loopingI see gnu parallel recommended many times here but how does that scale with disk I/O. OP has a 1000 files so surely this has the potential of saturating the I/O, if not invoked properly.
There are options to limit the parallel processing to a certain amount of jobs and/or memory usage depending on the limiting factor for your system.
Yes genomax2 it might saturate if invoked improperly. But like make it as a -j n option to limit number of jobs at a time to n. I usually use -j 2 or 4 on a 8 core system.
The best advice is to try and measure. For details see: https://oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster/
I apologize for my naiveness but how do I put multiple file directories inside {}?
In other words:
parallel "md5sum directory1/file1 directory1/file2 " :::* like that? Sorry I'm not familiar with parallel commands.
Another power tool is
find
, e.g.:find . -name '*.bam' | parallel -j 3 "md5sum {}"
In this case for bam files, but you easily customize this. And it's recursive, so it will search also in subdirectories for files matching ".bam". You can easily check it's output by redirecting the output of find to your terminal or a file `
find . -name '*.bam' > temp.txt
You could calculate the md5sum while downloading the file, e.g.:
Then you could just compare the downloaded md5sum with the ones available at the site of origin.
BTW This is just to play around and I have not tested it for very large files or extensively. But if the downloads are done via rsync (not always possible) rsync can be made to report a md5sum in the logfile.
The %C will will report md5sums for files in rsync output. The study.table file was ~60M in size. More testing is needed to see if this can work consistently or is of any use.
If you had access to a cluster you could start those jobs in parallel but otherwise you are on the right track.
Thank you everyone! I appreciate the help greatly!