Hi,
This is not a pure bioinformatics question but biostar is the best place I could think of in terms of asking this question. I observe some positions in the human genome where every single sample sequenced (whole exome sequencing) is a homozygote for the non-reference allele for hg19. What may be the biological reason for this?
Thanks
You don't say how many homozygous non-ref individuals you have, but one other possibility is that you are detecting paralogy. Imagine there are two copies of a DNA segment (A and A') present in the genome that only differ at the position you mention. Depending on how you did your alignment, you may have forced the A' DNA onto the A copy. This would be evident by low mapping quality. Or possibly A' is absent from the reference and is only present in your experimental genomes. This would force all A' DNA onto A.
as always, you have to question what the term "reference" means on the "reference allele" tag. I explain on an entry below that there's no point questioning the reference genome correctness, as it wasn't designed to represent us all, but to create a common template where we all could map our annotations to. what you found may be as correct as the reference allele, it's just that both were found on different experiments using different samples, nothing else.
That's what I think too but let's say the non-ref allele is a; why do we not see any AA or Aa genotypes in the population of more 50 unrelated samples?
Hi Pierre,
That's what I think too but let's say the non-ref allele is a; why do we not see any AA or Aa genotypes in the population of more 50 unrelated samples?
The question now becomes, if we think that it is indeed a sequencing error or a private mutation in the human genome reference then what are the underlying biological reasons that would result in such a region in the genome where everyone is a homozygote for one allele only?
I agree Pierre, that's what I think too but the question now becomes, if we think that it is indeed a sequencing error or a private mutation in the human genome reference then what are the underlying biological reasons that would result in such a region in the genome where everyone is a homozygote for one allele only?
the REFERENCE ALLELE does simply NOT REPRESENT the whole HUMAN CONSENSUS, so there's no point questioning its correctness. check out my answer on the arbitrariness of the "reference allele" tag.
An important aspect to consider is the number of cases you observe of homozygous non-ref allele. If you have only a few, say 5, then the site could be polymorphic. What ancestry are your samples compared to the European ancestry of the ref genome? If you have dozens of individuals that you've sequenced, then the situation is most likely a sequence error in the reference genome. This would be true for SNPs. CNVs or microsatellites present other scenarios.
Remember, the ref genome was built from sequencing of libraries made from genomic DNA of six different individuals of European ancestry. DNA from 1 or up to 6 persons comprises the sequence at a given position and so for much of the ref genome it is not really a single sequence, but a composite of a few-to-several individuals. That said, a simple error or limitation in the technology used in those days could give an error in the sequence. At the same time, all of your data could also be in error at that position. The ref genome was sequenced on both strands or with 2 different chemistries. This could still produce an error but those were reduced a lot by this approach.
Then, I would look at data from another large-scale project. One could be HapMap to see what the sequence is at this position in people of very different ethnic ancestries. Sequence data near a known SNP is likely to cover the region where you see a discrepancy. Anotehr type of large-scale project entails the whole-genome data from 1000 Genomes or the 69 genomes available from Complete Genomics. I'd go with the latter as a 3rd party source for data to resolve this issue.
Larry and Pierre have already pointed out the main reasons, but I would like to summary their answers by doing some stress on the key points that I constantly see that mislead most of my colleagues when they see the "reference allele" tag. the term "reference allele" is used to tag the particular allele present on the reference genome assembly, so for that reason:
it is an arbitrary convention, useful for nomenclature only, with no biological meaning. its correctness doesn't have to be questioned, it's just what it is.
the "reference allele" isn't at all what we all should have at that position, since the small set of human samples used to build the genome assembly represents a very limited source of variation.
population genetics has to be taken into account when a large number of samples are sequenced. overrepresented alleles may be found on particular populations, so checking those frequencies in advance would be helpful. HapMap, 1000 Genomes or even dbSNP directly are great informative resources to resolve this kind of issues.
current sequence efforts will surely soon allow a large summarizing project that will ultimately provide us with a consensus reference that would encompass much larger variation sources, hence we will be able to use it as a proper reference. but until then, we have to stick to the convention we have in order to understand each other. or do you think that the reference assembly is absolutely free of genetic disorders, and that it does not include rare variations or deleterious mutations?
these are fair points, but one must distinguish private mutations from rare polymorphisms. without resequencing a sufficient number of individuals, a sequencing mistake in the reference genome can mislead one into thinking a site is polymorphic, when in fact, it is monomorphic.
so we are stressing the same point: the fact that the reference sequence does not represent the consensus of the whole world population (which is currently ~7 billions, we shouldn't forget that number) does not mean that it's wrong, it's just what it is. also, I wouldn't be so certain regarding absolute monomorphic sites: maybe they are monomorphic just because they haven't been found... yet. if the threshold you set to consider a polymorphic site is 1%, we should still sequence many population samples in order to be completely sure that a particular site is absolutely monomorphic.
Obviously by chance you expect no more than 50% heterozygosity at any given allele, unless you have a case of extreme heterozygote advantage (unlikely). If you're seeing 100% heterozygosity, maybe those heterozygous alleles actually represent single-nucleotide differences between duplicated regions, and only one of the two duplicated regions is in the reference sequence, so all the reads from the sequence that is missing from the reference assembly are mapping to the sequence that is included.
It is a sequencing error in the reference genome.
This has nothing to do with HWE. If you see everyone has a heterozygote. That is about HWE deviation.
You don't say how many homozygous non-ref individuals you have, but one other possibility is that you are detecting paralogy. Imagine there are two copies of a DNA segment (A and A') present in the genome that only differ at the position you mention. Depending on how you did your alignment, you may have forced the A' DNA onto the A copy. This would be evident by low mapping quality. Or possibly A' is absent from the reference and is only present in your experimental genomes. This would force all A' DNA onto A.
as always, you have to question what the term "reference" means on the "reference allele" tag. I explain on an entry below that there's no point questioning the reference genome correctness, as it wasn't designed to represent us all, but to create a common template where we all could map our annotations to. what you found may be as correct as the reference allele, it's just that both were found on different experiments using different samples, nothing else.
This has nothing to do with HWE...
Hi Heng, you must be observing this too. How do you guys deal with this issue? Do you have an explanation for this? thanks
Thanks Greatly appreciated.