How to run MD5 check for many fastq files in subdirectories?
3
0
Entering edit mode
2.8 years ago

Hello,

I have received Illumina sequencing reads for 100 samples. I have 8 R1.fastq.gz and 8 R2.fastq.gz files for each sample in each subfolder. I want to run a MD5 check for all the fastq files in each subfolder.

My folder structure looks like:

/mydata/sequencing/clean
/mydata/sequencing/clean/Sample1
/mydata/sequencing/clean/Sample2
......
/mydata/sequencing/clean/Sample100

I am using the following code being on the /mydata/sequencing/clean directory.

find . -type f -exec md5sum-lite {} \;

This code is printing the md5 result on Terminal.

How do I get these results printed to a .txt file? Besides, is there a more elegant way to do this MD5 check? Thank you.

sequencing MD5 genome Hiseq Illumina • 5.7k views
ADD COMMENT
1
Entering edit mode

There are three ways to do this depending on destination location and destination file name:

  1. you want to store each file's md5sum where each file is located (Remove dry-run if you are okay with dummy run output) :

    $ find . -type f -name "*.pdf" | parallel --dry-run md5sum {} ">" {.}.md5sum

  2. you want to store each file's md5sum (one md5sum file for one file) at a single location (current directory in the example below) irrespective of file location and number of md5sum files are equal to number of files:

    $ find . -type f -name "*.pdf" | parallel --plus --dry-run md5sum {} ">" {/.}.md5sum

  1. You want to store all the md5sums in a single file in current directory and all MD5sums are written to a single file all.md5 in current directory:

    $ find . -type f -name "*.pdf" | parallel --plus md5sum {} > all.md5

ADD REPLY
0
Entering edit mode

Thanks a lot for the helpful comment.

ADD REPLY
2
Entering edit mode
2.8 years ago
sklages ▴ 170

Just re-direct the output:

find . -type f -exec md5sum-lite {} \; | tee -a md5sums.txt

This way ist straight-forward, thus "elegant". :-)

ADD COMMENT
0
Entering edit mode

thank you very much. I have another question. I have a .txt file containing all the md5 identifiers (32 digits) that was given by the sequencing company. Is it possible to verify that file containing all the md5 checks for each file?

ADD REPLY
1
Entering edit mode

yes,

md5sum -c my_checksums.txt

The txt file should contain the actual path to the files to be checked.

ADD REPLY
1
Entering edit mode

Great. Thanks. It worked.

ADD REPLY
2
Entering edit mode
2.8 years ago
supertech ▴ 180
$ md5sum *.fastq >  hashList.txt  #make a file with md5 values of each fastq file
$ md5sum -c hashList.txt   # It runs through list in the  hashList.txt and check one by one.

Output will be on the screen but. You could simply redirect the output to a file.

ADD COMMENT
0
Entering edit mode

But will it run over all the .fastq files in different subfolders being on the parent directory?

ADD REPLY
0
Entering edit mode

No, it won't run recursive. Your find-approach is working well, no need to search for another solution :-)

ADD REPLY
0
Entering edit mode

Ok cool. Thanks a lot.

ADD REPLY
0
Entering edit mode

sklages answered to this above.

ADD REPLY
1
Entering edit mode
2.8 years ago
yhoogstrate ▴ 150

I would also recommend you to run gzip --test *.gz. Even if the gzip files are succesfully transferred (for which you actually check using md5), it does not guarantee that the files are not corrupt. This also works for BAM files, which are effectively valid gzip files.

ADD COMMENT

Login before adding your answer.

Traffic: 1655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6