I'm trying to pull out haplotype frequencies from the 1000 genomes dataset. Suppose I have the interval Chr21: 20,548,907 - 20,549,196 in which there are about 10 SNPs. I wish to identify all the different phased haplotypes in the dataset (1092 individuals, or a subset of them) for this 300 bp region and then count them so as to determine their frequencies.
I've downloaded the Chr21 data (ALL.chr21.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz) and can visualize the genotypes using IGV or Genome Browse but do not know how to manipulate them so as to count the different haplotypes.
Any help on this would be great, thanks.
You should use 1000 Genomes phase 3 instead of phase 1 data for this, since it includes phased haplotype information.
Many thanks. I've now downloaded three equivalent vcf files (phase 3, phase 1, phase 1 no SHAPEIT) from 1000 genomes and am looking at one SNP using GenomeBrowse (Golden Helix), see attached xls. Is it not columns C, G and K which hold the phase information? So samples HG00097 and HG00106 are heterozygous at this SNP and the phase information is different in the three vcf files. But if phase 1 data does not contain the phased haplotypes, then columns G and K shouldn't even exist... Can you elaborate? Thanks.
Sorry, can't seem to attach xls file...
Thanks. The key portion of your comprehensive answer concerns the writing of the code.....
This should be a comment on donfreed's answer, not an answer of its own. Be more careful please.
I used samtools phase to divide 1000 genomes bam files into phase 0 and 1. For one snp genotyped as GC, does phase 0 correspond to G and phase 1 correspond to C?
This is not an answer, it should be a comment on a relevant post. I'm moving it to a comment now, please be more careful in the future.
If G is the reference allele and C is the alternate allele, then the answer to your question is 'yes'.
This belongs as a comment on @azmanr's post, not as an answer. Please be more mindful in the future.
Okay, so when I use samtools phase to get two consensus sequences for each phase, does phase 0 correspond to the reference sequence while phase 1 is the alternate consensus sequence? It is my understanding that if I have a phased bam file I can get genotypes specific to a haplotype and can use samtools phase to get the consensus sequences for each haplotype.
Thanks, Azman