Hi all,
I need to filter out SNPs with excess heterozygotes from a large vcf to get reliable estimates of heterozygosity and Fis.
My current estimates are obviously too high, and I think I might have potential paralogues mapped on top of each other or other mapping issues. However, I'm struggling to find a good way to filter out SNPs with excess heterozygotes from a big vcf (millions of SNPs) using already existing tools and scripts. Any advice on how I can solve this?
I have been thinking that I could either:
filter out SNPs showing fixed heterozygosity on a population basis
filter out SNPs based on Fis calculated per SNP (Fis can be used as an indication of poor mapping like explained here: https://gatk.broadinstitute.org/hc/en-us/articles/360035531992-Inbreeding-Coefficient )
filter out SNPs based on a fixed proportion of heterozygotes over the whole dataset (like only keeping SNPs that have proportion of heterozygotes less than 0.75 or something).
The VCF has already been filtered twice - once according to GATK’s best practices quality recommendations: QualbyDepth (QD) < 2.0, RMSMappingQuality (MQ) < 40.0, FisherStrand (FS) > 60.0, MappingQualityRankSumTest (MQRankSum) < -12.5, and ReadPosRankSumTest (ReadPosRankSum) < -8.0.
Then we also used vcftools to filter the VCF additionally by removing indels (--remove-indels), only retain bi-allelic SNPs (--min-alleles 2 --max-alleles 2), SNPs with a minimum quality of 40 (--minQ 40), maximum missingness of 0.9 (--max-missing 0.9), minimum depth > 15 (--min-meanDP 15, --minDP 15), and maximum depth < 60 (--max-meanDP 60, --maxDP 60). I think the maximum depth filtration gets read of possible mapping issues caused by repeats, but perhaps not paralogues.
Super grateful for any advice as I have been stuck on this problem for a while 🙈
p.s. After posting this I discovered a Biostar post on a similar topic, which might be useful if I take the third approach? Remove sites with heterozygote excess with VCFtools
After posting this, I found a way to filter out SNPs that were showing patterns of fixed heterozygosity in populations. Basically I just divided my VCF into population specific VCFs and then found all SNPs with fixed heterozygosity using grep like this:
0/1 is the heterozygote pattern and 10 is the number of individuals in the pop. I concatenated all the info on fixed heterozygous SNPs from all populations, and used this to filter the main VCF using inverted grep (using the chromosome + position info).
The chr.pos.fixed.heterozous.SNPs.vcf was just a file with the two first columns of the concatenated fixed.heterozous.SNPs.vcf giving chromosome and position. Not very elegant, but it seemed to be working.
Only problem is that my populations were quite small, so I might need to come up with some better filtering criterion.