Aim: Download public data in a range, calculate the haplotype frequency for SNPs in the region for each ethnic population.
I want to compute haplotype frequencies for several markers in a region for each 1000 genomes ethnic population. I was wondering if there is any tool like vcftools or other that can be used for this purpose. Specifically, for a set of regions, I want to find the genotypes for all markers in the region and compute the haplotype frequencies.
Right now, I am manually doing this by extracting each region from the vcf file using tabix-
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | vcf-subset -c CEU.list >CEU_region1.vcf
And then computing the frequency of every haplotype for each pair of SNPs in the region. The regions I am considering are small and typically contain only 2-3 SNPs and most of 1000 genomes data is phased and so this is not too computationally expensive but a little cumbersome.
Can anyone suggest a better solution to this problem?
Hi Diviya,
I am having the exact same question, so if by any chance you managed to find a good solution it would be great if you could let me know!
Best,
Anne