Question

Advice for genetic diversity stats analysis with both variant and variant sites

0

Entering edit mode

8 months ago

ekirsch • 0

Hello,

I am looking for guidance on genetic diversity analyses (pi, Tajima's D, etc.) using both variant and nonvariant sites. I think there should be some way to do this using a combination of my vcf file and my bed file of callable sites. I don't have the original .bam files (and they don't exist anymore), so I can't make a new vcf file that includes invariant sites to put into a program like pixy.

I am thinking I can use my callable sites file to calculate genetic diversity by summing up the total length of callable sites and using that as my denominator. However, I'm a little stuck on how exactly to do this.

Do I just need to run a script like this

vcftools --gzvcf your_data.vcf.gz --TajimaD 10000 --out tajimasD_output

and then divide the output by the length of callable sites? through a command like this?

awk '{sum += $3 - $2} END {print sum}' callable_sites.bed

Any advice is greatly appreciated : ) Thank you!

pi diversity vcf genetics • 563 views

ADD COMMENT • link updated 8 months ago by dthorbur ★ 2.5k • written 8 months ago by ekirsch • 0

1

Entering edit mode

I am unsure if there are tools that take explicitly take non-called sites into account. Otherwise I would suggest the R package PopGenome, which is pretty handy at calculating population genomic diversity metrics (link here just in case).

If there are no better existing tools, you could write a relatively simple R script to do exactly what you want. Use vcfR and data.table to read in your vcf and bed file, respectively, write a simple pi and Tajima's D calculator (formula are on wikipedia), and then adjust the calculation for each genomic range for number of non-called sites.

In my experience though, I would say that the number of uninformative sites is usually low enough that metrics like pi and Tajima's D are not meaningfully affected. It's easier in every respect to have a post-processing filter to remove genomic ranges with n+ uninformative sites. If you have a lot of unfinformative sites it's likely something else is going on that could skew your calculations in that range anyway.

ADD REPLY • link 8 months ago by dthorbur ★ 2.5k