Hi everyone,
I'm new to the community and new to the world of population genetics/genomics analyses, so thank you for your patience. I have written an analysis that I'd like to do some quality checks on, one of which would be to calculate an estimate of Watterson's theta per site. I know VCFtools has the option to calculate pi, but I haven't found an equivalent for theta. I know there are programs like ANGSD and R/PopGenome, but I was hoping there would be a way to do it via Unix from scratch (or close to it).
I've looked up the equation and gather that per site, my denominator would be the harmonic number of how many chromosomes were sampled at each position, after which I could sum theta per site for each gene and divide by how many sites were callable in each gene to get an approximation. However, I'm stumbling on how to determine the number of chromosomes sampled at each position.
Any advice on my theoretical understanding of this problem or ways to code this would be greatly appreciated! Thank you in advance for your help.
Definitely not what you asked but I have used Variscan http://www.ub.edu/softevol/variscan/ for ThetaW calculation plus a lot of other popgen stats. It is extremely fast but needs input in hapmap rather than VCF but a vcf can be easily converted to hapmap format.
Thank you very much for this suggestion! Do you have a preferred way of converting VCF to hapmap? It seems like Tassel would be easiest, but I've not worked with this format before.
I used a perl script to convert did not know that Tassel can do that ( Here seems to give some other ways as well Conversion of imputed VCF file to HapMap format )