Question

Why do R1 and R2 compressed files have different size

5

Entering edit mode

6.2 years ago

MAPK ★ 2.1k

I have a transcriptome data of 10.8gb R1.fastq and R2.fastq each. I then compressed these pairs using gzip R1.fastq and gzip R2.fast2, and now the files are 2.2gb and 2.4gb. Is it possible for two compressed files to have different size when the uncompressed files are of same size?

fastq gzip • 7.1k views

ADD COMMENT • link updated 6.2 years ago by dariober 15k • written 6.2 years ago by MAPK ★ 2.1k

5

Entering edit mode

File sizes should never be used as quantitatve anything. Count the number of reads in both files if you want to be certain.

ADD REPLY • link 6.2 years ago by GenoMax 147k

0

Entering edit mode

Thanks! I was submitting these pairs to NCBI sra and wanted to make sure this won't cause any problem.

ADD REPLY • link 6.2 years ago by MAPK ★ 2.1k

0

Entering edit mode

As you know I had this problem last time with the SRA file where two files were asymetric. I just wanted to submit the compressed file this time. Yes the wc -l indicates same number for both files

ADD REPLY • link 6.2 years ago by MAPK ★ 2.1k

1

Entering edit mode

Upload from a wired fast connection so there is no chance of corruption/interruption when doing the uploads.

ADD REPLY • link 6.2 years ago by GenoMax 147k

score 8 · Accepted Answer · 2018-09-13

8

Entering edit mode

6.2 years ago

swbarnes2 14k

Yes. It's perfectly possible, even if the reads are the same length. One might have sequences that are a little more repetitive, and therefore more compressible. If they have the same number of lines, that's all that matters.

It of course also possible to run gzip with different levels of compression, but you don't seem to have done that. in this case.

ADD COMMENT • link 6.2 years ago by swbarnes2 14k

0

Entering edit mode

One might have sequences that are a little more repetitive

Mmm... The difference the OP observes is quite noticeable. If the sequence is the cause, it may indicate some problem as read1's and read2's should be pretty random with respect to the genomic position. See my answer below for an alternative explanation. (Unless by "sequence" you include also the quality string, in which case my answer is similar to yours)

ADD REPLY • link 6.2 years ago by dariober 15k

score 8 · Accepted Answer · 2018-09-14

8

Entering edit mode

6.2 years ago

dariober 15k

A wild guess... Second-in-pair reads usually have base qualities that drops faster along the read compared to first-in-pair. This makes the quality line on each fastq record more variable (i.e. more random and less compressible) in R2 than in R1.

ADD COMMENT • link 6.2 years ago by dariober 15k