Question

Clumpify on multiple files

0

Entering edit mode

2.9 years ago

liorglic ★ 1.4k

I am trying to run clumpify (from the BBTools package) in order to deduplicate reads from multiple compressed Fastq PE files. Is that possible with clumpify, without first concatenating all files? So far I've tried:

clumpify.sh in=L1_R1.fq.gz,L2_R1.fq.gz in2=L1_R2.fq.gz,L2_R2.fq.gz out=dd_R1.fq.gz out2=dd_R2.fq.gz ziplevel=2 dedupe=t

but this resulted in an error - looks like the "," syntax is not supported here. I also tried:

clumpify.sh in=<(zcat L1_R1.fq.gz L2_R1.fq.gz) in2=(zcat L1_R2.fq.gz L2_R2.fq.gz) out=dd_R1.fq.gz out2=dd_R2.fq.gz ziplevel=2 dedupe=t

This one just gets stuck forever - I don't think it's doing anything, it's just waiting.

* I was able to do what I want using dedupe.sh from the same package, but based on a comparison on a single file, it is much much slower than clumpify.

Any ideas?

dedupe clumpify fastq • 1.2k views

ADD COMMENT • link 2.9 years ago by liorglic ★ 1.4k

0

Entering edit mode

Hi Lior,

The correct syntax for BBTools to input from stdin would be something like "cat foo.fq | script.sh in=stdin.fq", but that runs into trouble when there multiple input streams and I'm unfamiliar with the parenthetical piping notation you're using. It works for interleaved fastqs though.

Dedupe is intended for assemblies rather than raw reads, so clumpify is the best tool here. Unfortunately, with paired reads in twin files, you would need to concatenate first. Generally you don't need to decompress them before concatenation though (except in certain rare scenarios with incorrect gzip implementations) so it should be fast, at least.

ADD REPLY • link 2.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian. Unfortunately (for me), concatenating the files will require lots of disk space, so I'll have to find another solution. This would be a nice and handy feature, if the tool is still being developed.

ADD REPLY • link 2.9 years ago by liorglic ★ 1.4k

score 1 · Accepted Answer · 2022-01-18

1

Entering edit mode

2.9 years ago

liorglic ★ 1.4k

Update: using the good old tool fastuniq with a Linux trick called named pipes, I was able to do it, thanks to this post.

ADD COMMENT • link 2.9 years ago by liorglic ★ 1.4k