I am trying to calculate per gene nucleotide diversity (pi) for whole-genome-sequence data. I basically have whole genome resequenced data for many hundred individuals with ~1.2 million SNPs and a well annotated species with 36k genes. I was wondering if there is a method that would calculate per gene nucleotide diversity for whole genome sequencing data, ideally from a VCF file and in command line?
So far, I tried calculating pi with vcftools - - window-pi as well as - - site-pi but the window approach is not useful as my genes do not regularly distribute along windows and gaps. For the - -site-pi there is no explanation on how it is calculated and more problematically it does calculate more positions than exist. Another option was DnaSP6, but here I would need to produce vcf files for each gene and as piping or merging is not possible, all files would need to be uploaded manually.
I would do the following - make a single file per gene that contains the start and stop positions of the exons --> supply per gene exon coordinates as bed file and ask vcftools to calculate the pi statistic (per site) --> take the average of per site pi statistic to get an overall "per-gene" --> loop this over all genes
If site-specific pi does not work for you and is not what you want then you could try calculating it yourself by applying the classic 2pq formula (p is the frequency of ancestral and q is the frequency of derived) and normalize it by the total gene length.
loops , loops, loops : https://ryanstutorials.net/bash-scripting-tutorial/bash-loops.php
Thanks, but I still need a method that actually calculates per gene nucleotide diversity. And as mentioned DnaSP6 does not give the option of batch-running vcf files, which might have been a solution otherwise. Secondly, DnaSP6 is a Windows-based program that makes piping into it a hazle.
You could also try the R package PopGenome, I've only ever used it in windowed analyses, but I'm sure you could extract values from specific coordinates.
Alternatively, you could write a loop that subsets the VCF for the locus of interest, run VCFtools, emit the values, and repeat across all loci.
This is not really answering the question - one can loop say VCFtools, but it would still be windows/single site based. DnaSP also does not support command line commands.
Did you ever find an easy way to do this? Would be very interested to know if its feasible - looking to implement sth like this using bacterial genomes