- How did it come to be that the alternate nucleotide was more frequent than the reference nucleotide?
- How does one account for this phenomenon when designing a strategy to filter for variants of interest? Should I go through the complicated process of selecting those individuals who DO NOT have the variant and calculate that the REFERENCE frequency in the population is probably around (1 - esp6500siv_all)?
I am researching a rare disease and have whole exome sequence data with the corresponding variant calls. Each variant call has been passed to annovar and among other data, we have looked up the frequency of the variant in the esp6500siv2_all data. Clearly a variant that was observed to have a high frequency in our sample but that had low frequency in esp6500siv2_all would be of disproportionate interest.
Low and behold I was surprised to find that 13% of the all of our variants (4055 out of 32131) had an allele frequency that was greater than 0.5. How can that be? I expected that all the allele frequencies would be <0.5.
I had thought that the variants would be akin to a minor allele frequency (MAF). Clearly I was wrong. I pulled 3 random variants from among the variants that had more than 0.5 frequency, to check them against the Exome Variant Server.
avsnp147 Chr Start End Ref Alt Gene.refGene esp6500siv2_all
1: rs3803530 15 89632842 89632842 C A KIF7 0.5373
2: rs621383 3 125118840 125118840 T C SLC12A8 0.9988
3: rs633561 11 64229857 64229857 A G NUDT22 0.9418
Looking up at NHLBI Exome Sequencing Project (ESP) Exome Variant Server and using All Allele
- rs3803530: C>A; A=6984/C=6014 which means A is 6984/(6984+6014) or 0.537
- rs621383: T>C; C=12479/T=15 which means C is 12479/(12479+15) or 0.999
- rs633561: A>G; G=12240/A=756 which means G is 12240/(12240+756) or 0.941
the reference genome carries the rare allele.
OK and the reference genome would be just one person's at any one spot? The entire genome could be made up of many people's genome but at any one locus it would be just one person's sequence? So would I be correct that all the SNPs mentioned in the NHLBI Exome Sequencing Project (ESP) would be from an individual who is homozygous at that point? If they were heterozygous there could be no ref vs alt.
Hey Farrel, hg38 was released in the wake of the 1000 Genomes Project, where whole genome sequence data from ~2500 individuals became available. This information was in part used to construct hg38 where, for example, many of the rare disease risk alleles of hg19 were modified to represent more common alleles. hg38 also improves on sequence in centromeric and other repeat regions, which were previously difficult to sequence. So, yes, it's still a linear representation of the genome and has its flaws.
I'm not sure that I understand your point about the ESP. The ESP is a disease association study and includes heterozygous and homozygous variants.
ESP allele frequencies, like all others, are just counted 1 for het and 2 for hom. If my variant is observed as het in 1 individual in my cohort of 500 patients, then the allele frequency is
1 / (500 * 2) = 0.001%