I am trying to run clumpify (from the BBTools package) in order to deduplicate reads from multiple compressed Fastq PE files. Is that possible with clumpify, without first concatenating all files? So far I've tried:
clumpify.sh in=L1_R1.fq.gz,L2_R1.fq.gz in2=L1_R2.fq.gz,L2_R2.fq.gz out=dd_R1.fq.gz out2=dd_R2.fq.gz ziplevel=2 dedupe=t
but this resulted in an error - looks like the "," syntax is not supported here. I also tried:
clumpify.sh in=<(zcat L1_R1.fq.gz L2_R1.fq.gz) in2=(zcat L1_R2.fq.gz L2_R2.fq.gz) out=dd_R1.fq.gz out2=dd_R2.fq.gz ziplevel=2 dedupe=t
This one just gets stuck forever - I don't think it's doing anything, it's just waiting.
* I was able to do what I want using dedupe.sh
from the same package, but based on a comparison on a single file, it is much much slower than clumpify.
Any ideas?
Hi Lior,
The correct syntax for BBTools to input from stdin would be something like "cat foo.fq | script.sh in=stdin.fq", but that runs into trouble when there multiple input streams and I'm unfamiliar with the parenthetical piping notation you're using. It works for interleaved fastqs though.
Dedupe is intended for assemblies rather than raw reads, so clumpify is the best tool here. Unfortunately, with paired reads in twin files, you would need to concatenate first. Generally you don't need to decompress them before concatenation though (except in certain rare scenarios with incorrect gzip implementations) so it should be fast, at least.
Thanks Brian. Unfortunately (for me), concatenating the files will require lots of disk space, so I'll have to find another solution. This would be a nice and handy feature, if the tool is still being developed.