My usual pipeline for SNP calling is to MarkDuplicate and run GATK HaplotypeCaller. I have read somewhere on the GATK forum that HaplotypeCaller ignores reads that are marked as duplicates, however I think there is something to be learned from the duplicate reads. Ultimately they are a re-sequencing of the same read, so in theory, you can use all the duplictae reads to build a consensus read sequence and then more accurately call SNPs.
I think this is something that some SNP softwares must take advantage of, but I can't seem to find any with the key words I have been using to search. Are there SNP tools that take into consideration the marked duplicate reads? Or are there any tools available that will build consensus reads from duplicate reads, which you can then feed into any SNP caller?
Many thanks in advance!
I'd have to check, but if I were to write the HaplotypeCaller I'd have it include duplicates in the graph creation step and then ignore them for the realignment step. Were that not done (i.e., the marked duplicates not excluded during realignment), the HMM would produce aberrant homozygous calls. You might have to check the GATK code to see if this is how it works internally or not.
BTW, the HaplotypeCaller documentation says that it processes reads through the DuplicateRead filter first, so perhaps it doesn't do this. That should make things marginally faster. Realistically, I'd be surprised if including the marked duplicates appreciably changes the results. In order for the haplotype calling to be accurate, there'd need to be high enough coverage that excluding the marked duplicates should result in an (essentially) equivalent graph.
I'll also note that including the duplicates creates problems if you want to filter out nodes based on some minimum observance threshold (e.g., storing the graph in a count-min sketch and then only including nodes with a minimum count). I presume someone at the Broad run some experiments to test all of this (at least I hope so!).
Edit: I should mention that I wrote a bisulfite-sequencing indel realigner that works similarly to the HaplotypeCaller, sans the HMM realignment (I use a different method). There, I currently exclude marked duplicates from the whole process since not doing so is more likely to just increase the background k-mer noise than to ensure that the real haplotype paths are represented in the de Bruijn graph. I suppose I should run a check of this to compare the differences.
Since all duplicate reads have the same sequence, there would be no consensus sequence to build.
@rbagnall I think duplicates as marked by picard MarkDuplicates are reads sharing the same position and regardless of whether they have the same sequence.
Duplicate reads will have the same start coordinate, but may contain different mismatches in the alignment mostly due to sequencing errors, so they may not have the same sequence.
If that is true, how are you able to find heterozygote variants?
My guess would be that the sequence has to be the same.Edit: because you do shotgun sequencing. My new guess is you are probably right.