Wondering about pros/cons of removing duplicates from the raw fastq files vs the raw BAM alignment? Thanks.
Wondering about pros/cons of removing duplicates from the raw fastq files vs the raw BAM alignment? Thanks.
Basically duplicates are of two kinds:
Of course, we'd want to keep the first kinds of duplicates and remove the second kinds. But rarely if ever is a clear distinction possible between the two situations. Hence the conundrum.
While we are at it, an empirical observation that I made is that data with high rates of artificial duplication is often useless even after fixing this problem. Many other problems turn up. So it does not really matter what you do with it - remains useless.
In general, from what I understand, people tend to deduplicate their data where a uniform coverage is expected across the genome and when the coverage over a given position has major implications regarding the results. For example in SNP calling the number of reads supporting a variant is an essential decision maker in trusting that variant. We'd want to avoid using artificial duplicates there.
In most other cases, and especially when the expected coverages vary wildly and there are reasons for a fragment to occur very frequently (highly expressed short transcript in a transcriptome study) duplicate removal is not recommended.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Less work if you dedupe up front.
clumpify.sh
(Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) from BBMap does this without a need for alignments.Though personally I greatly prefer Clumpify for duplicate removal, mapping-based approaches can be more robust to reads with lots of errors (if you consider those duplicates). But in addition to the increased time, mapping-based also has the disadvantage of a lossy conversion to sam/bam format which typically chops off some of the original header (everything after the first whitespace).
I think some mapping-based deduplication tools may not be robust to read pairs that map to different chromosomes, or when only one read is mapped, and certainly not when neither read is mapped. I wrote a mapping-based deduplication program that handles duplicates in the first two scenarios, but as a result it uses a lot of memory. My recollection was that one of samtools or GATK handled duplicates of pairs mapped to different chromosomes, and the other didn't. And as for unmapped reads - some aligners will not map reads that have a lot of adapter sequence, even if they came from the correct genome, so those short-insert reads would not be deduplicated based on mapping the raw reads.
Multi-mapping reads can also pose a problem to mapping-based deduplication methods, depending on how the aligner handles ambiguity (e.g. non-determinsitic assignment is common), as can split alignments, which are produced by some aligners.
To support your contention, picard misses PE duplicates with mates mapping to different chromosomes.
Ah, thanks, Picard was indeed what I was thinking of.
Have you got a reference for that? I've read that Picard's MarkDuplicates can handle inter-chromosomal pairs and Samtool's rmdup cannot:
Plain observation. I've recently been improving the duplicate marking in deepTools and this is one of the few sources of difference between it and picard in the output. So even if they document catching them, they don't always.
Interesting and quite surprising, I'll double-check my data