I have DNA-seq data from a 30-gene capture panel, with UMIs in the FASTQ header for each read. This panel is for variant detection in tissue and cell-free DNA samples with a high coverage (>1000X) and the UMIs will help in removing duplicates from sequencing the same fragment of DNA multiple times, getting a better estimate of the allele fractions. However, we want to reduce the false calling of variants from sequencing errors as much as possible, so we will need to generate a consensus sequence for the DNA fragment from the multiple duplicates (with the same UMI and position). This is similar to Extract consensus sequence reads (collapse PCR duplicates) from bam, but not exactly the same, as the UMIs are in the FASTQ read id rather than the read itself.
In a similar situation (RNA-seq with UMIs), I have successfully used UMI-tools to deduplicate the mapped reads: UMI-tools dedup retains the one with highest mapping quality, lowest position or chooses one at random, which is fine for RNA-seq, but not for variant calling where the sequence of the mapped read is important.
There is also clumpify (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.) but this appears to work on the (FASTQ) reads only, which for high-depth capture-seq would mean that all reads matching the position would be compressed into a single read, even if the UMIs are different (if I've understood the documentation correctly).
Are there any other tools which can work with UMIs to deduplicate and generate a consensus sequence from the duplicates per deduplicated read?
I can't give an answer but can you split the fastq on the UMI's? If you can, you could maybe do something with a cluster or denovo assembly tool.
Some cluster tools have a consensus option: https://drive5.com/usearch/manual/output_files.html
graeme.thorn : You can just
clump
the reads together withclumpify
based on how strict you want the sequence identity to be. You don't have to compress/de-dupe them. Depending on how many reads you have you can then use a pileup/usearch to generate consensus.