I'm struggling to understand that. I have some BAM files that have been processed with Picard tool MarkDuplicates, both using the REMOVE_DUPLICATES flag and not. In both cases, I can see in IGV some reads with the exactly same length, position, orientation and composition in bases. Why haven't been removed by picard?
It is important to note that these reads have different names, but I cannot understand how Picard does not take these as duplicates, aren't they?
EDIT Important also to mention that these "duplicates" are evident with the BAM files generated by GATK HaplotypeCaller (which runs local realignments). With BAM files at previous stages, this does not happen.
EDIT 2 For clarity, I will describe with a little more detail the process followed.
Variants have been generated as:
BAM -> HaplotypeCaller -> CombineGVCFs -> GenotypeGVCFs -> VCF
Variants are visualized in IGV software together with the original BAMs. Because there is some discrepancy between the variants reported in the VCF and the visualization in IGV (specially for INDELs) y generated a second set of BAM files:
BAM -> HaplotypeCaller -bamput --force-active --disable-optimizations -> BAM*
The problem now is that when I use IGV to visualize VCF with this second set of BAM* files, there are clearly more reads, some of them looking like "duplicates".
After using Picard MarkDuplicates, these won't disapear, even with the option REMOVEE_DUPLICATES=true. More weirdly, if I then use FixMateInformation on the BAM* files after markind duplicates, the resultant BAM files will be empty. No idea on what is going on.
Are you looking at these reads in IGV or just in the alignment file? They may have the same length/position but that may be because they were soft-clipped by the aligner.
On a different note you can try
clumpify.sh
to identify PCR/Optical duplicates in an alignment independent manner. More here: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq FilesThank you, I will try that
Same orientation? Same base for base?
yes, thank you for noticing, I will update the post
Is this RNA-Seq or DNA-Seq? In case of RNA, is it whole transcritome, targeted, or 3' seq?
Please find a good explanation why you see duplications here.
Without proper molecular indexes, it's hard distinguishing between gene expression caused overrepresentation and PCR duplicates.
For PCR duplicates, check also the numbers of PCR cycles. Is the stacking of reads explainable by this number?
Cheers, Michael