Question

Are reads with the same length and position always "duplicates"

0

Entering edit mode

5.2 years ago

elcortegano ▴ 200

I'm struggling to understand that. I have some BAM files that have been processed with Picard tool MarkDuplicates, both using the REMOVE_DUPLICATES flag and not. In both cases, I can see in IGV some reads with the exactly same length, position, orientation and composition in bases. Why haven't been removed by picard?

It is important to note that these reads have different names, but I cannot understand how Picard does not take these as duplicates, aren't they?

EDIT Important also to mention that these "duplicates" are evident with the BAM files generated by GATK HaplotypeCaller (which runs local realignments). With BAM files at previous stages, this does not happen.

EDIT 2 For clarity, I will describe with a little more detail the process followed.

Variants have been generated as:

BAM -> HaplotypeCaller -> CombineGVCFs -> GenotypeGVCFs -> VCF

Variants are visualized in IGV software together with the original BAMs. Because there is some discrepancy between the variants reported in the VCF and the visualization in IGV (specially for INDELs) y generated a second set of BAM files:

BAM -> HaplotypeCaller -bamput --force-active --disable-optimizations -> BAM*

The problem now is that when I use IGV to visualize VCF with this second set of BAM* files, there are clearly more reads, some of them looking like "duplicates".

After using Picard MarkDuplicates, these won't disapear, even with the option REMOVEE_DUPLICATES=true. More weirdly, if I then use FixMateInformation on the BAM* files after markind duplicates, the resultant BAM files will be empty. No idea on what is going on.

next-gen picard • 2.0k views

ADD COMMENT • link 5.2 years ago by elcortegano ▴ 200

3

Entering edit mode

Are you looking at these reads in IGV or just in the alignment file? They may have the same length/position but that may be because they were soft-clipped by the aligner.

On a different note you can try clumpify.sh to identify PCR/Optical duplicates in an alignment independent manner. More here: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD REPLY • link 5.2 years ago by GenoMax 148k

0

Entering edit mode

Thank you, I will try that

ADD REPLY • link 5.2 years ago by elcortegano ▴ 200

0

Entering edit mode

Same orientation? Same base for base?

ADD REPLY • link 5.2 years ago by swbarnes2 14k

0

Entering edit mode

yes, thank you for noticing, I will update the post

ADD REPLY • link 5.2 years ago by elcortegano ▴ 200

0

Entering edit mode

Is this RNA-Seq or DNA-Seq? In case of RNA, is it whole transcritome, targeted, or 3' seq?

Please find a good explanation why you see duplications here.

Without proper molecular indexes, it's hard distinguishing between gene expression caused overrepresentation and PCR duplicates.

For PCR duplicates, check also the numbers of PCR cycles. Is the stacking of reads explainable by this number?

Cheers, Michael

ADD REPLY • link 5.2 years ago by michael.ante ★ 3.9k

score 2 · Answer 1 · 2019-10-16

2

Entering edit mode

5.2 years ago

Brice Sarver ★ 3.8k

While a bunch of single reads aligning to the exact same position is evidence for a set of duplicates, there are situations where this may not be the case for paired-end reads. Imagine you have a pair where you have the same forward read but different reverse reads. This could happen if you're capturing a particular target but the fragment size differs. In this case, you're sequencing independent molecules, and the reads aren't true duplicates.

Every read has a different name, so this isn't evidence one way or the other.

Could this be the case with your data? Do you have PE data, and what's your expected insert size?

ADD COMMENT • link 5.2 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Thank you for the reply, it helps me understand this is something might actually be happening. These "duplicates" typically have a length of 40bp in my dataset, but do not know exactly what is that expected insert size (I'm pretty new with this). Looking other reads, I'd say that these duplicates are generally smaller than the average.

ADD REPLY • link 5.2 years ago by elcortegano ▴ 200

0

Entering edit mode

How long are your reads? 40bp post-cleaning is pretty small. Genomax' comment above about soft clipping may be a factor here.

ADD REPLY • link 5.2 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

I have made an important edit in the post. I forgot to say that these "duplicates" are seen in the BAM files generated by HaplotypeCaller (which performs local realignments). In other files, MarkDuplicates seem to be more successful and I do not see these duplicates

ADD REPLY • link 5.2 years ago by elcortegano ▴ 200

1

Entering edit mode

Are you using the -bamOutput option from HaplotypeCaller to write haplotypes? HaplotypeCaller will do local reassembly of variable regions but typically writes to gVCFs or VCFs by default.

ADD REPLY • link 5.2 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

The haplotypes were generated without that option, and the resultant gVCFs files were combined for calling variants with CombineGVCFs and GenotypeGVCFs. In that moment, I realized that some variants in the VCF (mostly indels) do not match the positions shown in IGV, when using the BAMs that were used by HaplotypeCaller. That is why now I'm using -bamout output together with --force-active and --disable-optimizations to generate BAM files that include the local realignments made by GATK. However, this last BAM files seem to have more reads than before, and many of them look like "duplicates", that seem not being removed by MarkDuplicates

ADD REPLY • link 5.2 years ago by elcortegano ▴ 200