How does vcftools account for missing data when calculating nucleotide diversity?
1
0
Entering edit mode
2.0 years ago
Jimmy ▴ 30

I want to use vcftools to find nucleotide diversity across a set of individuals in a VCF file with the --site-pi and --window-pi commands. However, my VCF like any other has missing data at some genomic sites. I want to know, but can't seem to be able to find out, how vcftools accounts for missing data when calculating nucleotide diversity. If 5% of individuals at genomic position 1000 on chromosome 1 have a missing data point, does vcftools throw the entire site out in the calculation? Does it only do the nucleotide diversity calculation on the non-missing sites (which is what I want)? Does it count the missing sites as variants?

nucleotide-diversity vcftools • 973 views
ADD COMMENT
2
Entering edit mode
2.0 years ago
Jimmy ▴ 30

OK, so I've looked more into this and I think this paper provides an answer: "pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data" (2021). It looks like the presence of missing sites, and especially missing genotypes within a site, biases vcftools estimates of nucleotide diversity.

ADD COMMENT

Login before adding your answer.

Traffic: 1890 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6