Question

Calculating Tajima's D from vcf data using bed files

1

Entering edit mode

7.3 years ago

spiral01 ▴ 110

I need to calculate Tajima's D for vcf data, using a bed file to indicate the windows in which I want to calculate the statistic. I tried to use vcftools like so:

vcftools --gzvcf file.vcf.gz --bed file.bed --TajimaD 500 --stdout > output.txt

However vcftools requires that you indicate the size of the windows for which you wish to calculate the statistic (the 500 above). As such it ends up calculating Tajima's D for each scrolling window, instead of for each of the windows stipulated in my bed file.

Is there anyway to circumvent this, or any tool that is better suited to this task?

SNP • 5.5k views

ADD COMMENT • link updated 8 months ago by xoaib • 0 • written 7.3 years ago by spiral01 ▴ 110

0

Entering edit mode

You can use this tool called vcf2tajima https://github.com/xoaib4/vcf2tajima

ADD REPLY • link 8 months ago by xoaib • 0

score 0 · Answer 1 · 2018-10-31

0

Entering edit mode

6.5 years ago

Gabriel ▴ 50

Hi @spiral01, I am with the same issue here. Did you find a way to do it? Thanks! Gabriel

ADD COMMENT • link 6.5 years ago by Gabriel ▴ 50

0

Entering edit mode

Hi both! I too have the same problem. Did any of you manage to solve it?? Thanks a lot!

ADD REPLY • link 4.2 years ago by biosol ▴ 170

0

Entering edit mode

Did anyone of you get any solution? I'm facing same.

ADD REPLY • link 3.8 years ago by Muhammad • 0

0

Entering edit mode

You can use this tool called vcf2tajima https://github.com/xoaib4/vcf2tajima

ADD REPLY • link 8 months ago by xoaib • 0

score 0 · Answer 2 · 2022-03-30

I have run into this same issue and have not found a program that allows for this (I have attempted VCFtools, vcf-kit and dadi). I have written a python script to accomplish this for my data though I'm not certain I can attach it here and some portions of it will be extraneous anyways. I will describe the essential steps as best as I can though.

One must loop through a bed file doing the following:

Extract coordinate for single-line and input into a temporary bed file
Use this temporary bed file to subset exclusively those coordinates from a VCF file, input as a temporary VCF (I used bcftools view -R function for this)
Run Tajima's D on this VCF file using the window size as the END coordinate of the temporary bed file. This will cause Tajima's D to be calculated across the entire VCF with the BIN_START being 0. So for example if your coordinates are just from 10000 - 12000 bases you input 12000 as the window size, and VCFtools will calculate from 0 - 12000 but since you only have 10000 - 12000 in the VCF only those windows will actually be used.
grep out the single Tajima's D value into another file, if needed one can replace the BIN_START with a sed command using the real START coordinate as defined in the bed file

Although I recognize this is a hacky approach, as far as I can tell --TajimaD does not account for invariant/missing sites in any way, so having it ignore the several missing sites does not affect the results. I have not seen any better approaches.

I suspect something like this could also be done using the R package pegas and calculating only over specific windows (perhaps read in as a GenomicRanges file) but I will not attempt any of this.

EDIT: scikit-allel has a windowed_tajima_d function that does this in a more natural way: https://scikit-allel.readthedocs.io/en/stable/stats/diversity.html

It allows one to specify start and stop coordinates on which to calculate Tajima's D. It requires using a Python API but you can loop through a bed file to get coordinates and window sizes. I have written a Python script to accomplish this as well and feel it's more "natural" than the above approach in that it doesn't require using any direct shell commands or calculating Tajima's D starting at 0 for every interval. The values I get are marginally different from the VCFtools values (shown below for an example dataset) so I don't know what accounts for that. It does seem scikit-allel will report nan for windows that VCFtools will still calculate a value for, so perhaps the former is more conservative. scikit-allel also seems to run at comparable speed to the above approach IF subsetting regions from the VCF using a tabix executable, which is allowed in the read_vcf function.

VCFtools and scikit-allel calculations (respectively) for the same eight windows:

0.791088 0.8278262750578946
-1.61975 -1.612118529939708
-0.234816 -0.23481621199445057
-0.175853 -0.08344398964332823
-1.3498 -1.2630180884340458
-1.72715 -1.7346326888748866
-1.15154 -1.127119822745152
-1.06788 -1.107992834462033

Nick Bailey