Size difference using gunzip-cat-gzip and zcat
0
0
Entering edit mode
3.8 years ago
dazhudou1122 ▴ 140

Hi Everyone,

I have some fastq data downloaded from Illumina sequence hub. In these data, the sequences from one samples were split into four .gz files (i dont know why illumina does that). All the files together is about 18G. I first try to zcat every four files into one, but the file size inflated significantly from 18G to 78G:

for i in $(ls *.fastq.gz | rev | cut -c 22- | rev | uniq); 
do zcat ${i}_L001_R1_001.fastq.gz ${i}_L002_R1_001.fastq.gz ${i}_L003_R1_001.fastq.gz ${i}_L004_R1_001.fastq.gz > ./zcat_fastq/${i}.fastq.gz ;
done

I then did the dumb way, guzip all files, cat them together and then gzip them all, but now the files size is about 15Gb.

gunzip *.gz

for i in $(ls *.fastq | rev | cut -c 19- | rev | uniq); 
do cat ${i}_L001_R1_001.fastq ${i}_L002_R1_001.fastq ${i}_L003_R1_001.fastq ${i}_L004_R1_001.fastq > ./cat_fastq/${i}.fastq ;
done

gzip *.fastq

Can anyone please advice what is going on and which method is correct? Thank you!

Best,

Wenhan

RNA-Seq sequencing • 1.8k views
ADD COMMENT
1
Entering edit mode

using zcat+gzip is the slowest solution. Just use cat. A: How To Merge Two Fastq.Gz Files?

ADD REPLY
0
Entering edit mode

Your command would not gzip the concatenated FASTQ inside the cat_fastq directory. Are you sure that is gzipped and its size is 15G (lower than any one component lane's FASTQ)? That cannot be right.

Remember that zcat is gzip -dc, so in your first case, the output is not gzipped, so 4 x gunzip(18G) can easily come to 78G. You're not compressing the concatenated output.

You don't need to decompress, concatenate, recompress - you can simply concatenate .gz files:

cat ${i}_L00{1,2,3,4}_R1_001.fastq.gz > ${i}_L000_R1_001.fastq.gz
ADD REPLY
0
Entering edit mode

You are correct. I am sorry, i should make myself clear that I ran the gzip *.fastq in the cat_fastq folder. And yes, it is 15GB. And if i simply cat them all, the file size is 18GB. I just dont know where the discrepancy comes from.

ADD REPLY
0
Entering edit mode

The former method produced *plain text * (78G), and the latter one are .gz files again (15Gb).

You forgot to compress in the first one.

for i in $(ls *.fastq.gz | rev | cut -c 22- | rev | uniq); 
do zcat ${i}_L001_R1_001.fastq.gz ${i}_L002_R1_001.fastq.gz ${i}_L003_R1_001.fastq.gz ${i}_L004_R1_001.fastq.gz \
    | gzip -c > ./zcat_fastq/${i}.fastq.gz ;
done

15G < 18G makes sense, cause compressing a big file could reduce the files compared to compressing multiple small parts.

ADD REPLY
0
Entering edit mode

Thank you. Once I pipe the gzip -c in, it is 15 GB.

ADD REPLY

Login before adding your answer.

Traffic: 1791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6