How to convert a VCF with genotypes and phasing info to list of haplotypes for ROI/SOI?
VCF files created with GATK HaplotypeCaller/GenotypeGVCFs include genotypes and the phase between close by heterozygous genotypes.
In theory this should make it possible to output haplotypes for a Region Of Interest(ROI) and Sample Of Interest (SOI).
For example if there are 3 close by heterozygous genotpyes in 100bp region of interest, in theory there are 8 (=2 X 2 X 2) haplotypes. By looking at the phasing it might become clear there are only 2 haplotypes, i.e. the 3 heterozygous genotypes are in phase for all the samples of interest.
A few years back I tried to use VCFLib vcfgeno2haplo for this. But it did now work as I expected in my hands. https://github.com/vcflib/vcflib/blob/master/doc/vcfgeno2haplo.md
Does anyone know what currently are good tools to convert a vcf with genotype and phase info to haplotypes? And did you maybe also find out how trustworthy the GATK haplotype information is?
As command line example would be the following
genoAndPhaseToHaploTool -input my.vcf.gz -region Chr_01:100-200 -samples samples.txt
I am not sure how to best format the output, but I could imagine something like this
H1 ATCGATCG
H2 ATCCATCG
H3 ATCAATCG
H4 ATCTATCG
H5 ACCTATCG
H6 CCCTATCG
Sample1 = H1, H2
Sample2 = H2, H2
Sample3 = H4, H1
H1 = Sample 1, Sample 3, frequency = 0.33
H2 = Sample2, Sample 1 frequency = 0.5
H3 = None, frequency = 0
H4 = Sample3, frequency = 0.166
I have Illumina 150bp sequencing data for multiple samples. I am looking to leverage the phase information from the sequencing data to determine the haplotypes for small regions of interest. Regions can be as small as 100bp, or 150, so even within the Illumina read length.