Question

How to use multiples computational nodes/cores for Merging .fastq.gz files

0

Entering edit mode

9.1 years ago

ravi.uhdnis ▴ 220

Hi,

I want to merge multiple .fastq.gz files (forward/Reverse), and using following command:

zcat dir1/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir2/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir3/ETH002281_ACAGTG_L003_R1_001.fastq.gz | gzip > dir4/ETH002281_ACAGTG_Lall_R1.gz

Although it run fine but it takes huge time as I am able to run it on single node, I want to run it on multiples nodes as I have access of 15 nodes with 8 cores each. It would be great if I get idea how to merge multiples fastq.gz files using various computational nodes in order to finish the job earliest using maximum computational power of nodes. Thanks

Assembly next-gen-sequencing • 2.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by ravi.uhdnis ▴ 220

Ram · Answer 1 · 2015-10-02

3

Entering edit mode

9.1 years ago

h.mon 35k

You may use pigz, a parallel gz replacement, or simply cat instead of zcat|gzip:

cat  dir1/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir2/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir3/ETH002281_ACAGTG_L003_R1_001.fastq.gz > dir4/ETH002281_ACAGTG_Lall_R1.gz

Compression will not be as good as zcat|gzip, but it will be much faster.

ADD COMMENT • link 9.1 years ago by h.mon 35k

0

Entering edit mode

Thank you for response. I'll try pigz while using zcat | gzip. True, cat command is much faster(approx 40X) in comparison of gcat|gzip but i want to avoid it just as it doesn't compress merged files, expecting size differences in GBs of final merged files.

ADD REPLY • link 9.1 years ago by ravi.uhdnis ▴ 220

2

Entering edit mode

You can concatenate gzipped files and the result is still a valid compressed gzipped file; I don't really see any reason to avoid that. The difference in compression would be negligible compared to recompressing it unless you have millions of tiny files.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by Brian Bushnell 20k

0

Entering edit mode

I agree that cat-ing gzip files is the best solution here. However, I vaguely remember that, strictly speaking, a gzip file produced by concatenating individual gzips is not "valid" since the footer of the concatenated files does not represent the whole file but only the last gzip file concatenated.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by dariober 15k

score 2 · Answer 2 · 2015-10-02

2

Entering edit mode

9.1 years ago

cpad0112 21k

see if this post helps : Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them. This post talks about parallel program with examples using zcat.

ADD COMMENT • link 9.1 years ago by cpad0112 21k