I have run into this same issue and have not found a program that allows for this (I have attempted VCFtools, vcf-kit and dadi). I have written a python script to accomplish this for my data though I'm not certain I can attach it here and some portions of it will be extraneous anyways. I will describe the essential steps as best as I can though.
One must loop through a bed file doing the following:
- Extract coordinate for single-line and input into a temporary bed file
- Use this temporary bed file to subset exclusively those coordinates from a VCF file, input as a temporary VCF (I used bcftools view -R function for this)
- Run Tajima's D on this VCF file using the window size as the END coordinate of the temporary bed file. This will cause Tajima's D to be calculated across the entire VCF with the BIN_START being 0. So for example if your coordinates are just from 10000 - 12000 bases you input 12000 as the window size, and VCFtools will calculate from 0 - 12000 but since you only have 10000 - 12000 in the VCF only those windows will actually be used.
- grep out the single Tajima's D value into another file, if needed one can replace the BIN_START with a sed command using the real START coordinate as defined in the bed file
Although I recognize this is a hacky approach, as far as I can tell --TajimaD does not account for invariant/missing sites in any way, so having it ignore the several missing sites does not affect the results. I have not seen any better approaches.
I suspect something like this could also be done using the R package pegas and calculating only over specific windows (perhaps read in as a GenomicRanges file) but I will not attempt any of this.
EDIT: scikit-allel has a windowed_tajima_d function that does this in a more natural way: https://scikit-allel.readthedocs.io/en/stable/stats/diversity.html
It allows one to specify start and stop coordinates on which to calculate Tajima's D. It requires using a Python API but you can loop through a bed file to get coordinates and window sizes. I have written a Python script to accomplish this as well and feel it's more "natural" than the above approach in that it doesn't require using any direct shell commands or calculating Tajima's D starting at 0 for every interval. The values I get are marginally different from the VCFtools values (shown below for an example dataset) so I don't know what accounts for that. It does seem scikit-allel will report nan for windows that VCFtools will still calculate a value for, so perhaps the former is more conservative. scikit-allel also seems to run at comparable speed to the above approach IF subsetting regions from the VCF using a tabix executable, which is allowed in the read_vcf function.
VCFtools and scikit-allel calculations (respectively) for the same eight windows:
- 0.791088 0.8278262750578946
- -1.61975 -1.612118529939708
- -0.234816 -0.23481621199445057
- -0.175853 -0.08344398964332823
- -1.3498 -1.2630180884340458
- -1.72715 -1.7346326888748866
- -1.15154 -1.127119822745152
- -1.06788 -1.107992834462033
You can use this tool called vcf2tajima https://github.com/xoaib4/vcf2tajima