Hello,
I am looking for guidance on genetic diversity analyses (pi, Tajima's D, etc.) using both variant and nonvariant sites. I think there should be some way to do this using a combination of my vcf file and my bed file of callable sites. I don't have the original .bam files (and they don't exist anymore), so I can't make a new vcf file that includes invariant sites to put into a program like pixy.
I am thinking I can use my callable sites file to calculate genetic diversity by summing up the total length of callable sites and using that as my denominator. However, I'm a little stuck on how exactly to do this.
Do I just need to run a script like this
vcftools --gzvcf your_data.vcf.gz --TajimaD 10000 --out tajimasD_output
and then divide the output by the length of callable sites? through a command like this?
awk '{sum += $3 - $2} END {print sum}' callable_sites.bed
Any advice is greatly appreciated : ) Thank you!
I am unsure if there are tools that take explicitly take non-called sites into account. Otherwise I would suggest the R package
PopGenome
, which is pretty handy at calculating population genomic diversity metrics (link here just in case).If there are no better existing tools, you could write a relatively simple R script to do exactly what you want. Use
vcfR
anddata.table
to read in your vcf and bed file, respectively, write a simple pi and Tajima's D calculator (formula are on wikipedia), and then adjust the calculation for each genomic range for number of non-called sites.In my experience though, I would say that the number of uninformative sites is usually low enough that metrics like pi and Tajima's D are not meaningfully affected. It's easier in every respect to have a post-processing filter to remove genomic ranges with n+ uninformative sites. If you have a lot of unfinformative sites it's likely something else is going on that could skew your calculations in that range anyway.