Hello,
I have been analyzing some resequencing data (8 genomes from a non-model organism) and I've been trying to filter SNPs by quality. The organism is haploid (I'm pretty sure), but I noticed a small proportion of heterozygous SNP calls after bwa alignment to the reference genome and fairly standard variant calling with samtools mpileup. When I looked at the putative heterozygous positions in IGV, I saw an (apparently) very obvious explanation for them: they were all called at positions with incredibly high coverage (e.g. 2000X, compared to 50-100X across other contigs). I guess these are reads from repetitive loci, of which there are quite a few in the genome. (duplicate reads were removed as part of the initial samtools pipeline).
I wanted to filter out these cases to identify a set of high-quality SNPs for variant score recalibration. However, the DP scores (and DP4 scores) for these spurious "heterozygous" sites were barely any higher than for other homozygous sites which had much lower coverage as visualized by IGV. (mean DP = 32 for "heterozygous", 24 for "homozygous).
Using bedtools coverage on the initial VCF files, I got coverage values corresponding to the massive peaks seen in IGV (2000+). There was a massive difference in bedtools coverage scores between hetero and homo calls - median 1664 (hetero) vs 260 (homo). So I was able to use these values to easily filter out the spurious high-coverage cases.
My question: why were the DP values so low for these sites, and is there a way to access this info. (which turned out to be very useful) for filtering SNPs using mpileup?
A follow-up question: I noticed that the coverage for homozygous SNPs (median 260) was a good deal higher than for positions matching the reference assembly (which were in the 50-100x range) [and the heterozygous SNPs were even higher]. Is this worrisome? The organism is a crazy parasite with low GC content - about 25-30% at most. Could this result from less obvious cases of reads from other loci mapping to the wrong place and causing false-positive SNP calls?
Yes, they were filtered. Perhaps this is why they were so low.