Dear community members,
I have an Illumia array and after transformation to VCF it looks like (one line as an example)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NAME
1 752721 rs3131972 C T . . PR GT 0/1
Now I need to extract information about these variants from a large cohort of WGS samples.
The problem is - C is not actually REF allele for this variant ( https://www.ncbi.nlm.nih.gov/snp/rs3131972?horizontal_tab=true ). For some variants REF is actually REF, but for half they are switched.
When I look this variant in array specs, I see a line
rs3131972-138_T_R_2263598533,rs3131972,TOP,[A/G],0060710106,AACGTTCACTTTCTGTCTGTGTTCACGTCACCAAGAGAATAGAAAGGAAA,,,37,1,752721,diploid,Homo sapiens,dbSNP,138,BOT,GCCTGGACTGGAGGGCTGTCTCAAGGAGGGTGACGTGTCTTTGACTTTTGCATTCTTCCC[T/C]TTTCCTTTCTATTCTCTTGGTGACGTGAACACAGACAGAAAGTGAACGTTTTTTGCATAA,TTATGCAAAAAACGTTCACTTTCTGTCTGTGTTCACGTCACCAAGAGAATAGAAAGGAAA[A/G]GGGAAGAATGCAAAAGTCAAAGACACGTCACCCTCCTTGAGACAGCCCTCCAGTCCAGGC,1897,3,0,+
so the variant here is even A/G.
Is there a way to normalize a VCF to reference, to fix REF/ALT? I am absolutely lost since I supposed it to be a very simple procedure but it seems very complex. I can't rely even on rs-IDs - they are missing for many array variants.