I'd like to introduce a new member of the BBMap package, Clumpify. This is a bit different from other tools in that it does not actually change your data at all, simply reorders it to maximize gzip compression. Therefore, the output files are still fully-compatible gzipped fastq files, and Clumpify has no effect on downstream analysis aside from making it faster. It’s quite simple to use:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz reorder
This command assumes paired, interleaved reads or single-ended reads. But Clumpify also now works with paired reads in twin files like this:
clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=clumped1.fq.gz out2=clumped2.fq.gz reorder
How does this work? Clumpify operates on a similar principle to that which makes sorted bam files smaller than unsorted bam files – the reads are reordered so that reads with similar sequence are nearby, which makes gzip compression more efficient. But unlike sorted bam, during this process, pairs are kept together so that an interleaved file will remain interleaved with pairing intact. Also unlike a sorted bam, it does not require mapping or a reference, and except in very unusual cases, can be done with an arbitrarily small amount of memory. So, it’s very fast and memory-efficient compared to mapping, and can be done with no knowledge of what organism(s) the reads came from.
Internally, Clumpify forms clumps of reads sharing special ‘pivot’ kmers, implying that those reads overlap. These clumps are then further sorted by position of the kmer in the read so that within a clump the reads are position-sorted. The net result is a list of sorted clumps of reads, yielding compression within a percent or so of sorted bam.
How long does Clumpify take? It's very fast. If all data can fit in memory, Clumpify needs the amount of time it takes to read and write the file once. If the data cannot fit in memory, it takes around twice that long. That's on our cluster, though, with a high-performance parallel filesystem. On some filesystems, or with a single spinning disk, it may take several times longer when data does not fit into memory, because multiple files are constantly being read and written at once.
Why does this increase speed? There are a lot of processes that are I/O limited. For example, on a multicore processor, using BBDuk, BBMerge, Reformat, etc. on a gzipped fastq will generally be rate-limited by gzip decompression (even if you use pigz, which is much faster at decompression than gzip). Gzip decompression seems to be rate-limited by the number of input bytes per second rather than output, meaning that a raw file of a given size will decompress X% faster if it is compressed Y% smaller; here X and Y are proportional, though not quite 1-to-1. In my tests, assembly with Spades and Megahit have time reductions from using Clumpified input that more than pays for the time needed to run Clumpify, largely because both are multi-kmer assemblers which read the input file multiple times (and according to Megahit's author, due to cache locality). Something purely CPU-limited like mapping would normally not benefit much in terms of speed (though still a bit due to improved cache locality).
When and how should Clumpify be used? If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed. I recommend running Clumpify on ALL data that will ever go into long-term storage, or whenever there is a long pipeline with multiple steps and intermediate gzipped files. Also, even when data will not go into long-term storage, if a shared filesystem is being used or files need to be sent over the internet, running Clumpify as early as possible will conserve bandwidth. The only times I would not clumpify data are enumerated below.
When should Clumpify not be used? There are a few cases where it probably won’t help:
- For reads with a very low kmer depth, due to either very low coverage (like 1x WGS) or super-high-error-rate (like raw PacBio data). It won’t hurt anything but won’t accomplish anything either.
- For large volumes of amplicon data. This may work, and may not work; but if all of your reads are expected to share the same kmers, they may all form one giant clump and again nothing will be accomplished. Again, it won’t hurt anything, and if pivots are randomly selected from variable regions, it might increase compression.
- When your process is dependent on the order of reads. If you always grab the first million reads from a file assuming they are a good representation of the rest of the file, Clumpify will cause your assumption to be invalid – just like grabbing the first million reads from a sorted bam file would not be representative. Fortunately, this is never a good practice so if you are currently doing that, now would be a good opportunity to change your pipeline anyway. Randomly subsampling is a much better approach.
- If you are only going to read data fewer than ~3 times, it will never go into long-term storage, and it's being used on local disk so bandwidth is not an issue, there's no point in using Clumpify (or gzip, for that matter).
Please let me know if you have any questions, and please make sure you are using the latest version of BBTools when trying new functionality.
Edit: Although I was unaware when I wrote Clumpify, I have learned that there are prior examples of programs designed to put reads into buckets containing identical substrings - for example, SCALCE and MINCE. The approaches are quite different but the objective and results are similar. So, I encourage people seeking maximal compression (or planning to build something even better) to investigate those tools as well!
Edit: I've now released version 37.24 which has some nice optical deduplication improvements. It's now faster (many times faster in certain degenerate cases), and there are improvements in precision for NextSeq tile-edge duplicates. Specifically, it is now recommended that they be removed like this:
clumpify.sh in=nextseq.fq.gz out=clumped.fq.gz dedupe optical spany adjacent
This will remove all normal optical duplicates, and all tile-edge duplicates, but it will only consider reads to be tile-edge duplicates if they are in adjacent tiles and share their Y-coordinate (within dupedist), rather than before, in which they could be in any tiles and could share their X-coordinate. This means that there are fewer false-positives (PCR or coincidental duplicates that were being classified as optical/tile-edge duplicates). This is possible because on NextSeq the tile-edge duplicates are only present on the tile X-edges and the duplicates are only between adjacent tiles.
Normal optical duplicate removal (HiSeq, MiSeq, etc) should use this command:
clumpify.sh in=nextseq.fq.gz out=clumped.fq.gz dedupe optical
Great work. Talking about fastq compression, it is worth mentioning this paper, and fqzcomp and dsrc in particular.
Also worth noting:
Alapy
uQ
BBTools supports both fqzcomp and alapy (and pigz and pbzip2) if they are on the command line, meaning that you can do something like bbduk.sh in=reads.fq.ac ref=phix.fa.bz2 out=clean.fqz and they will all be appropriately compressed/decompressed. However, in my tests, while Clumpify greatly improves .gz and .bz2 compression, it makes .fqz (fqzcomp) and .ac (alapy) compression slightly worse (presumably because consecutive headers become less similar after reordering). I think I had some kind of problem testing dsrc which is why it's not currently supported, but I should re-evaluate that.
This is a good job, especially useful for deep sequencing data
I am using clumpify on HiSeq4000 data. As recommended, I increased dupedist to 2500 because of patterned flow cells. However, I get exclusively optical duplicates using this option:
Gives the same output as:
Any ideas? All optical duplicates sound not right.
See if this explanation helps.
If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically. So please use:
or
Perhaps I should clarify that...
I just tested clumpify with NA12878 fastq files.
The size of original read1 file (SRR952827_1.fastq) is
211G
, but the size of clumpified data (clump.SRR952827_1.fastq) is182G
. Why it's smaller?I didn't enable any deduplication. The command is:
Edit - for some reason I overlooked the fact that your files are not gzipped. The output size should be identical for uncompressed files unless the program was terminated before it was done writing. Can you post the exact command and the screen output, as well as the BBTools version? Also, can you run "wc" on both files?
@Brian: Based on the names the files appear to be un-compressed though.
Oh, hmm, that's odd. I guess I did not read the question carefully enough. Let me edit that post...
chen : How much RAM did you assign for this job?
My system has 1T RAM, and I noticed 738G memory was available and clumpify automatically set the JVM memory limit to be 738G.
So it's based on the similarity of sequences? If two reads (50 bp) overlap with 25 bp, will they be cluster together? Usually, we use Hiseq2500 to do RNAseq, CLIPseq, 3'reads-seq, etc. Does the type of libraries affect the deduplication result? Thanks very much.
I like the tool. I'm using clumpify.sh followed by reformat.sh. Is there a way to sort by the number of copies in the output?
No. You will need to devise a custom script yourself.
Hi, I used the command -
clumpify.sh in=file1.fastq out=file1_dedup.fastq dedupe=t
However, I still see duplicates in my output file. Am I missing something?
What kind of duplicates and what makes you certain you still see them?
These should be PCR duplicates. After aligning them to my reference, I see the same read mapping to the same chromosome , having the same start site multiple times.
Based on your command line this seems to be single-end sequencing data? Even if the start is identical the length of the reads that are mapping may be different. If you have single end data then each of those reads would be considered unique.
Can you post the stats from your clumpify.sh run? You should have the number of
clumps
formed and the number of duplicates found clearly noted there.I have paired end sequencing data from MiSeq. For an analysis, I am required to merge the reads. So, I merged the reads using bbmerge.sh and then use clumpify. I have the same start and the length of the reads is also the same.
I did remove duplicates, but I can still find duplicates in my fastq file.
Is it because I am merging the two fastq files that I see something like this?
Can you please run
clumpify.sh
on original data files that have not been trimmed or otherwise manipulated? Merge the reads after you remove duplicates.Note: Have you checked the section on "When clumpify should not be used" in the original post to make sure nothing applies in your case. I assume these are amplicon data?
I tried it that way too. I used clumpify on original data but still found a lot of duplicates.
This is RNA-Seq data from a virus. The length of the genome is ~3000 bps. I need the duplicates remove because I am look at heterogeneity at a particular location and I need deduplicated reads for it.
This may be another case where having overly deep coverage may be causing an issue. Why not just map the reads as is (you should not be deduplicating RNAseq data in normal cases) and look for the region you are interested in.
Edit: You are using default
subs=2
option since you did not change it. It you want to allow only perfect matches make it more stringent e.g.subs=0
or relax it if you want to allow for more differences in reads considered duplicates.Thank you for developing such a great product. Do you have recommendations for parameters to run dedupe with for sequences from an Illumina iSeq? Such as what number to run dist with or if any other parameters should be run with the optical parameter similar to NextSeq? Thank you for your time and help.
NextSeq flow cells are particularly large in size so it would likely not be appropriate to use those settings with iSeq. You may want to experiment with HiSeq settings to see if you have a problem with duplicates first.
Okay thank you very much. The HiSeq settings seem to work just fine.
Very useful tool!
After I removed duplication reads from raw PE reads using the following command:
I rechecked the duplication level of the dedupe PE reads using FASTQC. The result showed that there are still many duplication reads in the remaining reads.
Any suggestions? Thanks!
A suggestion on your command usage:
is a LOT better than
GNU Screen/tmux are better than nohup because of the amount of control they offer.
Take a look at this blog post from authors of FastQC on subject of duplication. You are only allowing perfect matches so unless the sequence is identical for R1 read they are not going to be removed.
What would the optimal
dist
parameters be for removing duplicates for MiSeq or MiniSeq data?Cluster duplicates (as in adjacent wells have the same fragment) are not possible with MiSeq (and likely MiniSeq) so you probably should not be looking for them (I assume you are referring to
dupedist=
).