Parallel with zcat for merging fastq.gz files
1
0
Entering edit mode
6.5 years ago
suny.bio • 0

I am creating a single .fastq.gz file from many .fastq.gz files with the following command

zcat 15_S15*.fastq.gz | gzip -c > combined_file.fastq.gz

  • I like to keep my original fastq.gz files and create a combined fastq.gz file, that's why using gzip -c

Now, I want to do it with gnu parallel command.

Anyone help me

rna-seq RNA-Seq assembly • 8.4k views
ADD COMMENT
2
Entering edit mode

In addition to the answer of ATpoint you could have a look at pigz for parallel compression.

ADD REPLY
0
Entering edit mode

cat 15_S15*.fastq.gz | pigz -p 4 > combined_file.fastq.gz

works beautifully. Thanks a lot WouterDeCoster

ADD REPLY
1
Entering edit mode

This makes no sense. You're compressing already-compressed files, which is adding to your runtime. All you have to do is write the output of cat to a file as ATpoint shows; you don't need to recompress it.

http://mattmahoney.net/dc/dce.html#Section_11

ADD REPLY
1
Entering edit mode

I don't think you can write to a single file handle from multiple independent processes (you could do a small test to convince yourself). Parallel does not make sense in this case.

ADD REPLY
0
Entering edit mode

zcat 15_S15*.fastq.gz | parallel --pipe --block 2M > output.fastq.gz or zcat 15_S15*.fastq.gz | parallel --pipe --N140000 > output.fastq.gz

Memory(M) and number(N) can be configured.

ADD REPLY
1
Entering edit mode

Do you know first hand if this will produce sane results? See my comment above.

ADD REPLY
0
Entering edit mode

Negating the efficiency of the prorams (cat and parallel) for this issue, output from cat and parallel are as below:

input:

$ ls
hcc1395_normal_rep1_r1.fastq.gz  hcc1395_normal_rep1_r2.fastq.gz

output from cat and zcat and parallel (gzip is used to gzip the resultant fastq):

$ cat hcc1395_normal_rep1_r*.fastq.gz > combined.fq.gz
$ zcat hcc1395_normal_rep1_r*.fastq.gz | parallel -k --pipe --block 2M gzip > test.fastq.gz

$ seqkit stats combined.fq.gz test.fastq.gz 
    file            format  type  num_seqs      sum_len  min_len  avg_len  max_len
    combined.fq.gz  FASTQ   DNA    663,916  100,251,316      151      151      151
    test.fastq.gz   FASTQ   DNA    663,916  100,251,316      151      151      151

md5sums would be different as parallel output is not sequential.

Multiqc results from fastqc on both the files: test

ADD REPLY
0
Entering edit mode

I'm getting error

parallel: Error: --pipe/--pipepart must have a command to pipe into (e.g. 'cat').
ADD REPLY
1
Entering edit mode

@OP: Try

 parallel -k --block 2M  zcat ::: 15_S15*.fastq.gz > test.fastq.gz

Results on example files:

$ seqkit stats *.gz
file                             format  type   num_seqs      sum_len  min_len  avg_len  max_len
hcc1395_normal_rep1_r1.fastq.gz  FASTQ   DNA     331,958   50,125,658      151      151      151
hcc1395_normal_rep1_r2.fastq.gz  FASTQ   DNA     331,958   50,125,658      151      151      151
hcc1395_normal_rep2_r1.fastq.gz  FASTQ   DNA     331,958   50,125,658      151      151      151
hcc1395_normal_rep2_r2.fastq.gz  FASTQ   DNA     331,958   50,125,658      151      151      151
hcc1395_normal_rep3_r1.fastq.gz  FASTQ   DNA     331,956   50,125,356      151      151      151
hcc1395_normal_rep3_r2.fastq.gz  FASTQ   DNA     331,956   50,125,356      151      151      151
test.fastq.gz                    FASTQ   DNA   1,991,744  300,753,344      151      151      151
ADD REPLY
0
Entering edit mode

This works like a charm.

Thanks a lot cpad0112

ADD REPLY
1
Entering edit mode

But there is no need to decompress and compress again, the answer of ATpoint is what you need, not this. Okay it works but it can't be efficient.

ADD REPLY
0
Entering edit mode

@OP: Output would be in .fastq format. Not in gzipped format. I overlooked that part. You need to add zipping command for gz.

ADD REPLY
6
Entering edit mode
6.5 years ago
ATpoint 85k

The command is cat 15_S15*.fastq.gz > combined.fq.gz. No need to do g(un)zip. The compressed files can simply be catted with plain cat. You cannot parallelize this step.

ADD COMMENT

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6