Hello all,
I am trying to calculate nucleotide diversity on 192 samples and have used vcftools and pixy to calculate it. However, the results from both pipelines are dissimilar. Is there a way to evaluate which one is the accurate estimate of nucleotide diversity?
Here is the pipeline I used:
vcftools --vcf input.vcf --max-missing 0.1 --minQ 30 --maf 0.1 --remove lowdepthindividuals --recode --recode-INFO-all --out output_filtered.vcf
bcftools +prune -l 0.2 -w 50kb output_filtered.vcf -Ov -o output_filtered_ldpruned.vcf
Pi calculations: VCFtools:
vcftools --vcf output_filtered_ldpruned.vcf --window-pi 10000 --out pi
Pixy:
pixy --stats pi --vcf output_filtered_ldpruned.vcf --zarr_path ./zarr \
--window_size 10000 --populations allpop.list --bypass_filtration yes \
--bypass-invariant-sites yes --outfile_prefix results/combined
The results from VCFtools have pi estimates between 0 - 0.020 whereas the ones from pixy has estimates from 0.1 - 0.4. What could be causing the discrepancy between the two methods?