I'm trying to annotate IDs in a VCF with WG data using another VCF with genotype data. Both are aligned to the same reference, hg19. One thing to note is there are misaligned references for some reason? See here
POphased chr22 (genotype data VCF)
22 17075353 rs5747999 A C . PASS . GT 0|1 1|1 1|0 0|1 1|1
22 17203103 rs2845380 A G . PASS . GT 1|1 1|1 1|1 1|1 1|1
22 17282666 rs5994022 G A . PASS . GT 1|1 1|1 1|1 1|1 1|1
Peak at AGR chr22 (WG data VCF)
22 17075353 . C A . . . GT 0|1 0|0 1|1 1|0 1|1
22 17203103 . A G . . . GT 1|1 1|1 1|1 1|1 1|1
22 17282666 . G A . . . GT 1|1 0|1 1|1 1|1 1|1
I used the following bcftool line to annotate overlapping positions across the data:
bcftools annotate -c ID -a SA_POtest.recode.vcf.gz -o annot.vcf AGR_test.recode.vcf.gz
This seemed to work but only for SNPs that have the same ref. allele (i.e. no misalignment), which is a very small subset of the total SNPs available in the genotype data (1126 / 21640). Looking at the same positions in the new file, you find the following pattern, rsIDs are present where ref/alt alleles match and vice versa where mistmatches lead to no missing rsIDs.
Peak at annot.vcf
22 17075353 . C A . . . GT 0|1 0|0 1|1 1|0 1|1
22 17203103 rs2845380 A G . . . GT 1|1 1|1 1|1 1|1 1|1
22 17282666 rs5994022 G A . . . GT 1|1 0|1 1|1 1|1 1|1
How can I fix the mismatch Ref/Alt alleles?
I suspect they (ref/alt) got switched somewhere during a snake-make phasing pipeline or during file format converting. Both data sets were aligned to the same reference, which to my understanding, supports that they shouldn’t have such large discrepancies.