Question

Is there software that takes advantage of MarkedDuplicates for SNP calling?

0

Entering edit mode

10.3 years ago

shuelga ▴ 20

My usual pipeline for SNP calling is to MarkDuplicate and run GATK HaplotypeCaller. I have read somewhere on the GATK forum that HaplotypeCaller ignores reads that are marked as duplicates, however I think there is something to be learned from the duplicate reads. Ultimately they are a re-sequencing of the same read, so in theory, you can use all the duplictae reads to build a consensus read sequence and then more accurately call SNPs.

I think this is something that some SNP softwares must take advantage of, but I can't seem to find any with the key words I have been using to search. Are there SNP tools that take into consideration the marked duplicate reads? Or are there any tools available that will build consensus reads from duplicate reads, which you can then feed into any SNP caller?

Many thanks in advance!

SNP MarkDuplicates Consensus Sequence GATK • 3.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by shuelga ▴ 20

2

Entering edit mode

I'd have to check, but if I were to write the HaplotypeCaller I'd have it include duplicates in the graph creation step and then ignore them for the realignment step. Were that not done (i.e., the marked duplicates not excluded during realignment), the HMM would produce aberrant homozygous calls. You might have to check the GATK code to see if this is how it works internally or not.

BTW, the HaplotypeCaller documentation says that it processes reads through the DuplicateRead filter first, so perhaps it doesn't do this. That should make things marginally faster. Realistically, I'd be surprised if including the marked duplicates appreciably changes the results. In order for the haplotype calling to be accurate, there'd need to be high enough coverage that excluding the marked duplicates should result in an (essentially) equivalent graph.

I'll also note that including the duplicates creates problems if you want to filter out nodes based on some minimum observance threshold (e.g., storing the graph in a count-min sketch and then only including nodes with a minimum count). I presume someone at the Broad run some experiments to test all of this (at least I hope so!).

Edit: I should mention that I wrote a bisulfite-sequencing indel realigner that works similarly to the HaplotypeCaller, sans the HMM realignment (I use a different method). There, I currently exclude marked duplicates from the whole process since not doing so is more likely to just increase the background k-mer noise than to ensure that the real haplotype paths are represented in the de Bruijn graph. I suppose I should run a check of this to compare the differences.

ADD REPLY • link 10.3 years ago by Devon Ryan 105k

0

Entering edit mode

Since all duplicate reads have the same sequence, there would be no consensus sequence to build.

ADD REPLY • link 10.3 years ago by rbagnall ★ 1.8k

1

Entering edit mode

@rbagnall I think duplicates as marked by picard MarkDuplicates are reads sharing the same position and regardless of whether they have the same sequence.

ADD REPLY • link 10.3 years ago by dariober 15k

1

Entering edit mode

Duplicate reads will have the same start coordinate, but may contain different mismatches in the alignment mostly due to sequencing errors, so they may not have the same sequence.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by shuelga ▴ 20

0

Entering edit mode

If that is true, how are you able to find heterozygote variants? ~~My guess would be that the sequence has to be the same~~.

Edit: because you do shotgun sequencing. My new guess is you are probably right.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Zaag ▴ 870