Dear Colleagues,
I am new to variant calling and started to analyse my VCF generated from WES bam files to isolate clinical relevant germline variations. The VCF was generated using GRCh38 as reference sequence. Now I stumpled over the fact that a hugh amount of variants carry a obviously very low global AF.
For example https://www.ncbi.nlm.nih.gov/snp/rs2728532 , Variant G>T seem to have a GAF G=0.00242 and T=0.99758. Does this mean that 'T' is the correct global genotype and 'G' is a rare 'variation' ?
Thank you in advance for your help !
Thank you very much for your explanation and the link to the majref sequence. From an evolutionary point of view, I would expect that a particular variant carried by 99.99% of humans should be the one that represents the near-optimum in terms of gene function (assuming that, on average, we are the crown of creation and do not need improvement ;)). Therefore, I thought that a "reference" should reflect the highest global genotype frequency ... but good to know that I was wrong.
How do you define optimum across ethnicities and geographies with their varied histories? Let's say the amount of oxygen needed by the body is optimized evolutionarily based on the altitude that a group of people have lived in; what would you consider a global near-optimum here? Your idea of a reference genome seems to come from a simplistic view of the world.
You are right, and I am aware of this fact, but I have to start from a reference that gives me an "oversimplified" gene structure. In the advanced stage, it could then be interesting to see what a variation does in a Sherpa population.
Why would you expect that? The vast majority of variants have no known effect on gene function and are not under selective pressure.