Different estimates of nucleotide diversity (pi) from two pipelines: pixy vs vcftools
1
1
Entering edit mode
4.0 years ago
nitinra ▴ 50

Hello all,

I am trying to calculate nucleotide diversity on 192 samples and have used vcftools and pixy to calculate it. However, the results from both pipelines are dissimilar. Is there a way to evaluate which one is the accurate estimate of nucleotide diversity?

Here is the pipeline I used:

vcftools --vcf input.vcf --max-missing 0.1 --minQ 30 --maf 0.1 --remove lowdepthindividuals --recode --recode-INFO-all --out output_filtered.vcf
bcftools +prune -l 0.2 -w 50kb output_filtered.vcf -Ov -o output_filtered_ldpruned.vcf

Pi calculations: VCFtools:

vcftools --vcf output_filtered_ldpruned.vcf --window-pi 10000 --out pi

Pixy:

pixy --stats pi --vcf output_filtered_ldpruned.vcf --zarr_path ./zarr \
--window_size 10000 --populations allpop.list --bypass_filtration yes \
    --bypass-invariant-sites yes --outfile_prefix results/combined

The results from VCFtools have pi estimates between 0 - 0.020 whereas the ones from pixy has estimates from 0.1 - 0.4. What could be causing the discrepancy between the two methods?

vcftools nucleotide diversity pixy • 2.4k views
ADD COMMENT
0
Entering edit mode
2.4 years ago
Sumaya • 0

From what I read, it seems that vcftools includes missing data as invariant genotyped base (i. e. hom. allele as reference) and this make a biased estimates which is not the case in Pixy as it exclude any missing data.Have a look pixy paper: https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13326

ADD COMMENT

Login before adding your answer.

Traffic: 1943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6