Entering edit mode
3.8 years ago
dazhudou1122
▴
140
Hi Everyone,
I have some fastq data downloaded from Illumina sequence hub. In these data, the sequences from one samples were split into four .gz files (i dont know why illumina does that). All the files together is about 18G. I first try to zcat every four files into one, but the file size inflated significantly from 18G to 78G:
for i in $(ls *.fastq.gz | rev | cut -c 22- | rev | uniq);
do zcat ${i}_L001_R1_001.fastq.gz ${i}_L002_R1_001.fastq.gz ${i}_L003_R1_001.fastq.gz ${i}_L004_R1_001.fastq.gz > ./zcat_fastq/${i}.fastq.gz ;
done
I then did the dumb way, guzip all files, cat them together and then gzip them all, but now the files size is about 15Gb.
gunzip *.gz
for i in $(ls *.fastq | rev | cut -c 19- | rev | uniq);
do cat ${i}_L001_R1_001.fastq ${i}_L002_R1_001.fastq ${i}_L003_R1_001.fastq ${i}_L004_R1_001.fastq > ./cat_fastq/${i}.fastq ;
done
gzip *.fastq
Can anyone please advice what is going on and which method is correct? Thank you!
Best,
Wenhan
using zcat+gzip is the slowest solution. Just use cat. A: How To Merge Two Fastq.Gz Files?
Your command would not gzip the concatenated FASTQ inside the
cat_fastq
directory. Are you sure that is gzipped and its size is 15G (lower than any one component lane's FASTQ)? That cannot be right.Remember that
zcat
isgzip -dc
, so in your first case, the output is not gzipped, so 4 x gunzip(18G) can easily come to 78G. You're not compressing the concatenated output.You don't need to decompress, concatenate, recompress - you can simply concatenate .gz files:
You are correct. I am sorry, i should make myself clear that I ran the gzip *.fastq in the cat_fastq folder. And yes, it is 15GB. And if i simply cat them all, the file size is 18GB. I just dont know where the discrepancy comes from.
The former method produced *plain text * (78G), and the latter one are
.gz
files again (15Gb).You forgot to compress in the first one.
15G < 18G makes sense, cause compressing a big file could reduce the files compared to compressing multiple small parts.
Thank you. Once I pipe the gzip -c in, it is 15 GB.