I have several individual sequence data sets with very low average read depth (i.e. < 5) across most sites that cannot be further improved upon with repeated rounds of sequencing.
Consequently, I'm struggling with determining the best way to call SNPs and genotype. I've been using GATK's UnifiedGenotyper for legacy reasons, but the concerns raised would apply to newer programs such as HaplotypeCaller
Specifically, even though these are diploid Drosophila genomes, with read depths of 2-3, calling diploid genotypes is too error-prone to pursue. My first thought was to set the -ploidy argument to 1 rather than 2, i.e. to treat this as a haploid genotyping problem.
However, that would be problematic as well, since for heterozygous loci, a SNP site would be identified as indeterminate and set to "R" rather than to a nucleotide.
What I need is a program that would do the following: suppose that at a site I have read depth 4, with 3 A's and 1 G identified (with, reference C). I want A to be returned as the "haploid" genotype. If there are 2 A's and 2 G's, either A or G should be returned at random. In other words, if there are more than two nucleotides, return the more abundant of the two, or if counts are equal, select one or the other at random (whether it matches the reference allele or ot).
This would do what UnifiedGenotyper (or HaplotypeCaller) would do for ploidy = 1 at homozygous sites while returning an allele at heterozygous sites rather than treating the site as having an unknown genotype.
Is there a way to do this (or something similar) using GATK's tools, and if not, what tools may be available (using only bam files and a reference genome as input)? I suppose one approach may be to run the variant caller as diploid and then write my own script to convert that into the desired output, but I'm hoping for a more efficient method.
My original idea was to use the GATK diploid genotype caller with a low genotyping quality threshold (I have read depths around 5, sometimes lower). However, it was brought to my attention that even with a low quality threshold, this would lead to a systematic bias against identifying SNPs at homozygous vs. heterozygous variant sites, because e.g. a case of 3 A, 1 G may be identified as "N" rather than genotyped as A/G, whereas 4 A's (if A is a variant) would be identified as a homozygous SNP.
To eliminate this systematic bias, I need something that will just return nucleotide counts at each site rather than make genotype calls.
That's not right, the GATK caller doesn't substitute Ns instead of alternate bases. What can happen is that the caller gives the genotype call a very low probability of being heterozygous, and sometimes doesn't even emit a variant call if the probability is very low (which will depend a lot on the base quality scores). This is tunable, so you can get variant calls emitted even for very poorly supported potential hets (like the 4:1 allele balance scenario you mention).
But to be frank, if a call is so poorly supported that the caller doesn't emit it with default settings, you can't really trust it either way. You have no way to know whether that one G is a real alternate allele from a het site, or just a preparation or sequencing artifact. There are a LOT of artifacts in an average sequencing run, and if you were to accept that every site where there is one alternate base is a heterozygous SNP, your analysis would produce some very questionable results.
You're not wrong that there can be real systematic bias in exome sequencing, but that's more a function of the data generation technology that leads to having such lopsided allele balance in the reads. It's very difficult to fix that at the variant calling stage -- in fact that's why we use sophisticated probability modeling methods rather than just counting nucleotides in a pileup.
Generally, you may need to make some difficult decisions about what information it is possible to extract from your dataset. Depending on the purpose of your analysis, you may want to limit it to sites where you can make confident calls. Good luck.