As the title implies, "what are phased and unphased genotypes?" I am playing with 1000 genomes data and am not sure if I should be handling phased/unphased genotypes differently.
documentation on the internet seems to be quite sparse...
As the title implies, "what are phased and unphased genotypes?" I am playing with 1000 genomes data and am not sure if I should be handling phased/unphased genotypes differently.
documentation on the internet seems to be quite sparse...
Phased data are ordered along one chromosome and so from these data you know the haplotype. Unphased data are simply the genotypes without regard to which one of the pair of chromosomes holds that allele.
Hi
actually (I think) phased or unphased status is not related to any measure of quality. For each individual, there are two chromosomes labelled (arbitrarily when you do not have genotypes of the parents) paternal and maternal. The names are self-explanatory.
For a haterozyguous genotype at a SNP position (which is called conditional on some quality score), you may know which allele is on the maternal chromosome and which one is on the paternal chromosome. The genotyped is "ordered". If you are able to assign, for a heterozyguous call (still conditional on the quality) at another SNP position which allele is on the paternal chromosome and which one is on the maternal, then you are able to phase these two SNPs - or more precisely, to phase the alleles at this SNPs. You then get an haplotype - or a suite of "ordered" SNPs.
In this context, having ordered 0/1 at SNP1 and 1/0 at SNP 2 is not the same as having 0/1 at SNP 1 and 1/0 at SNP 2.
First gives : 0 1 while second gives 0 0 _____ _____
1 0 1 1
Now, one could use some pre-estimated phase information on a panel population - typically different from the population where you call your alleles - to help calling an allele when the quality is low. This is what BEAGLECALL do, usually in a chip genotyping context.
As for the 1000 G data, having the phased data helps getting a better estimate of linkage disequilibrium. This also means that the format may differ so you need to take care when you take this as an input. But besides input format and more info about LD, the way you may use phased and unphased here are not really different.
Christian
PS : sorry if I went too far to the basics
hi, i've read about the concept of phased haplotypes and ordered genotypes but never worked with any data. When the OP says they have a genotype called as 0|1 what are the numbers? Is it paternal allele/ maternal allele or is the paternal allele always 0 and the maternal allele always 1
hi, i've read about the concept of phased haplotypes and ordered genotypes but never worked with any data. When the OP says they have a genotype called as 0|1 what are the numbers? Is it paternal allele/ maternal allele so paternal allele = 0 and maternal allele = 1 for this SNP
My experience is with phased data from 1000 genomes for imputation programs (so not vcf files). There, you have one line per chromosome (in a .haplo type file) - I think paternal is the first one. There,the 0 and 1 refers to a code from a descriptive marker file. Let's say rs1 has alleles A and G and rs2 is C T Then ind1 0 1 ind1 0 0 means thath ind 1 bears haplotypes A - T A - C If the convention is paternal/maternal, then 0/0 - 0/1 Could you tell us which file you are using ? What I was referring to is 1000G processed file intended for softs like IMPUTE or MACH
Let me check .... You're right, -1 for me. Was after a long night. Apologies Let me rephrase (although I am sure you inderstand) Ind 1 id 0 1 at rs1 and 0 0 at rs2, he will have the haplotypes (00) / (10). Or (AT)/(GT).
I really screwed up the example but not easy to do hapltoype things ...
Apologies
well explained here
If you are analysing the 1000G data taking each SNP as an independent data point, you most probably don't need phased data. If what you are studying are correlations between, say, pairs of SNPs, and can be influenced by recombination, like linkage disequilibrium or selective sweeps, then you need phased data.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
A biallelic genotype comes from two chromosomes. Phased means I know not only the genotypes but which chromosome each genotype call came from. This lets you interpret which sets of genotypes are being inherited together; google haplotype if this isn't clear.
No. A lot of depth is not needed to call major and minor allele. First, there is no such thing as major/minor for an individual; those are population values. Allele calls for an individual's sample are based on sequence quality - so two reads can do it, one with an A and one with a G. If high quality, the subject is a heterozygote. SNPs from the 1000G data are in dbSNP 132, I believe.
I don't quite understand your second question. Either rephrase that or give me some time to think about this.
The genotype at hand would need to have a lot of depth and allele counts to be able to determine the major and minor alleles then right? Plus, even though the phased data is "ordered," the order of the bases don't really matter right (Aa is the same as aA)?
Sorry I wasn't clear with the 2nd question. Say I have a genotype called as 1|0. This is the same as 0|1 right? Also, from the example you provided above, supposing one of the reads was horrible (we aren't sure if the called G is really a G), then instead of having a "phased" AG genotype we would have an "unphased" AG genotype?