Question

How can we explain a lower reference allele frequency than alternative allele in large populations for some SNPs?

0

Entering edit mode

18 months ago

m.koohi.m ▴ 120

I've been checking dbSNP database and I found in a large population sample like ALFA project, the reference allele frequency appears to be much lower than the alternative allele frequency for some SNPs i.e. rs3130253.

https://www.ncbi.nlm.nih.gov/snp/rs3130253

Typically, one might expect the reference allele to be more prevalent, so this seems weird to me. Could anyone shed light on potential reasons or mechanisms that could explain why a large portion of a population might exhibit a higher frequency of an alternative allele over the reference allele? Thanks

mutation human-genome SNP • 1.4k views

ADD COMMENT • link updated 18 months ago by Ram 45k • written 18 months ago by m.koohi.m ▴ 120

0

Entering edit mode

So your question is why is the reference allele at times occurring in lower frequencies than the alternate allele?

Typically, one might expect the reference allele to be more prevalent

Not exactly. Besides the technical explanations given in the comments below, the frequencies of alleles segregating in a population change over time due to various factors (drift, linkage disequilibrium, natural selection, sweeps etc). So the reference allele (a single nucleotide observed at a certain position in the "reference" genome) could have changed in frequency in different populations owing to many factors.

ADD REPLY • link 18 months ago by manaswwm ▴ 570

score 1 · Answer 1 · 2024-01-22

The reference genome contains a rare allele. So common alleles for other samples/individual will have a higher AF for the same variant.

Typically, one might expect the reference allele to be more prevalent,

no, you don't swap REF and ALT alleles, the REFerence genome (hg19, hg38... ) is fixed. The REF and ALT wont change just because AF(REF) < AF(ALT).

score 1 · Answer 2 · 2024-01-22

A reference genome is a simplification of an individuals genome typically into a single haploid sequence. It generally represents the population from which it was sampled, but even then is typically an over-simplification of a genome. You lose information about polymorphic loci when creating the single linear reference sequence (hence the creation of reference genome graphs).

If you are looking at SNPs from a differentiated population in your dbSNP database, then it is likely that many positions will have differences where the reference allele is either rare or the alternative allele is fixed.

An alternative reason is that the reference genome was not representative of the initial population from which it was sampled for some loci, and that alternative alleles are more common.