Question

Comparing Two Fastq Files Linux

0

Entering edit mode

3.7 years ago

joe_genome ▴ 50

Hello community,

I've read different methods on using linux commands such as diff to compare the content between text files, was trying to apply this methods to fastq files. I'm looking to compare the fastq.gz created by using two separate bcl conversion methods, see if they produce the same fastq files.

Steps

Obtain BCL files from Illumina Sequencer
Convert BCL files to FASTQ (fastq.gz) format with: a. Method Conversion 1 (bcl2fastq) b. Method Conversion 2 (bcl-convert)
Determine if files are valid with validfastq
Compare fastq file content

Objective: Determine if sequence and quality score lengths match between files and that there are no mismatched labels found between files.

Is there any current methods other than diff that con provide this? Should I sort the files first and then compare via diff?

Thank you in advanced.

illumina genomics • 7.1k views

ADD COMMENT • link updated 3.5 years ago by Ram 44k • written 3.7 years ago by joe_genome ▴ 50

1

Entering edit mode

It would be good to use Comparing sequence files generated by bcl-convert and bcl2fastq as title instead of Comparing two files Linux. Consider editing the title.

ADD REPLY • link 3.7 years ago by GenoMax 148k

2

Entering edit mode

3.7 years ago

GenoMax 148k

joe_genome : My guess is pending differences in filtering options implemented in the two packages the fastq files generated should be very similar if not identical. The order of reads may even be the same (try out a FC with both). bcl-convert is meant to be a replacement for bcl2fastq so I would be surprised if the two packages generate very different results.

I found this nice comparison chart for bcl-convert and bcl2fastq on Illumina's site.

ADD COMMENT • link 3.7 years ago by GenoMax 148k

score 3 · Accepted Answer · 2021-04-07

3

Entering edit mode

3.7 years ago

5heikki 11k

For exact matching you just hash the files:

$ cat file1
aaaaa
aaaaa
11111
aaaaa

$ cat file2
aaaaa
aaaaa
11111
aaaab

$ md5sum file*
d10d7704b2f469251f21258d828e4c2c  file1
8645da65516e4b632f2acc572dd7f616  file2

ADD COMMENT • link 3.7 years ago by 5heikki 11k

0

Entering edit mode

Thanks for the clarification, I wanted to go more in depth and view the content of the files themselves, see where labels and other lines are differing.

ADD REPLY • link 3.7 years ago by joe_genome ▴ 50

0

Entering edit mode

Well, then it's either diff or comm..

ADD REPLY • link 3.7 years ago by 5heikki 11k

0

Entering edit mode

Thanks :)

ADD REPLY • link 3.7 years ago by joe_genome ▴ 50

0

Entering edit mode

5heikki OP wants to compare fastq records generated by two methods so a simple md5sum may not work since we are not assured that the order of reads would be identical in two files.

ADD REPLY • link 3.7 years ago by GenoMax 148k

1

Entering edit mode

Well the order part is easy to deal with..

md5sum <(sort file1)
md5sum <(sort file2)

ADD REPLY • link 3.7 years ago by 5heikki 11k

0

Entering edit mode

Can you sort fastq files like this? I have never tried it.

ADD REPLY • link 3.7 years ago by GenoMax 148k

1

Entering edit mode

You can sort any txt file. Could you recover a valid fastq file post this sorting? Of course not. In case there were some weird linebreaks this of course wouldn't work (but given that it's fastq format, I think the chances of having something odd are rather low..)

$ cat file1
@SEQ_ID1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC66

$ cat file2
@SEQ_ID2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC66
@SEQ_ID1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

$ md5sum file1
cd5813859537cd1b44f4d77b81770fca  file1

$ md5sum file2
ff19aaf0c31e0f2473db7c98f3a5788e  file2

$ md5sum <(sort file1)
724584b21d1fd044a77ab3b69c2a1505  /dev/fd/63

$ md5sum <(sort file2)
724584b21d1fd044a77ab3b69c2a1505  /dev/fd/63

ADD REPLY • link 3.7 years ago by 5heikki 11k

1

Entering edit mode

Good to know! This should satisfy OP's requirement then, if the files are identical in all aspects except the order. You should add this as example in your original answer.

ADD REPLY • link 3.7 years ago by GenoMax 148k

0

Entering edit mode

Wasn't aware of this capability either, will gunzip the original fastq.gz files and apply this method, appreciate it a lot :)

ADD REPLY • link 3.7 years ago by joe_genome ▴ 50

0

Entering edit mode

You can do this:

md5sum <(zcat file.fq.gz | sort)

ADD REPLY • link 3.7 years ago by 5heikki 11k

0

Entering edit mode

The following error is presented @

sort: write failed: /tmp/sortLwtZzI: No space left on device
d41d8cd98f00b204e9800998ecf8427e  /dev/fd/63

Is this because of the tmp file?

ADD REPLY • link 3.7 years ago by joe_genome ▴ 50

0

Entering edit mode

I was afraid of that. /tmp is generally mapped to RAM on many servers.

ADD REPLY • link 3.7 years ago by GenoMax 148k

0

Entering edit mode

oh darn, I have a server that is quite large, so I believe I could allocate here (possibly)?

ADD REPLY • link 3.7 years ago by joe_genome ▴ 50

1

Entering edit mode

Either export TMPDIR=some_local_directory with enough space or use sort -T /dir_you_want_to_use. sort then should use that space for its work.

ADD REPLY • link 3.7 years ago by GenoMax 148k

1

Entering edit mode

Also given that it's a large task it pays off to define:

-S, --buffer-size=SIZE
    use SIZE for main memory buffer

and

--parallel=N
    change the number of sorts run concurrently to N

e.g. -S 90% --parallel=16

ADD REPLY • link 3.7 years ago by 5heikki 11k

0

Entering edit mode

Thanks, I changed the TMP file to larger local directory, seems to be doing the trick for the moment. I'll also take into consideration these flags as well. Appreciate all the help :)

ADD REPLY • link 3.7 years ago by joe_genome ▴ 50