Why do R1 and R2 compressed files have different size
2
5
Entering edit mode
6.1 years ago
MAPK ★ 2.1k

I have a transcriptome data of 10.8gb R1.fastq and R2.fastq each. I then compressed these pairs using gzip R1.fastq and gzip R2.fast2, and now the files are 2.2gb and 2.4gb. Is it possible for two compressed files to have different size when the uncompressed files are of same size?

fastq gzip • 7.0k views
ADD COMMENT
5
Entering edit mode

File sizes should never be used as quantitatve anything. Count the number of reads in both files if you want to be certain.

ADD REPLY
0
Entering edit mode

Thanks! I was submitting these pairs to NCBI sra and wanted to make sure this won't cause any problem.

ADD REPLY
0
Entering edit mode

As you know I had this problem last time with the SRA file where two files were asymetric. I just wanted to submit the compressed file this time. Yes the wc -l indicates same number for both files

ADD REPLY
1
Entering edit mode

Upload from a wired fast connection so there is no chance of corruption/interruption when doing the uploads.

ADD REPLY
8
Entering edit mode
6.1 years ago

Yes. It's perfectly possible, even if the reads are the same length. One might have sequences that are a little more repetitive, and therefore more compressible. If they have the same number of lines, that's all that matters.

It of course also possible to run gzip with different levels of compression, but you don't seem to have done that. in this case.

ADD COMMENT
0
Entering edit mode

One might have sequences that are a little more repetitive

Mmm... The difference the OP observes is quite noticeable. If the sequence is the cause, it may indicate some problem as read1's and read2's should be pretty random with respect to the genomic position. See my answer below for an alternative explanation. (Unless by "sequence" you include also the quality string, in which case my answer is similar to yours)

ADD REPLY
8
Entering edit mode
6.1 years ago

A wild guess... Second-in-pair reads usually have base qualities that drops faster along the read compared to first-in-pair. This makes the quality line on each fastq record more variable (i.e. more random and less compressible) in R2 than in R1.

ADD COMMENT

Login before adding your answer.

Traffic: 1770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6