I want to cluster HAPMAP project data using EIGENSTRAT. Currently, I have difficulties with creating genotype file. In the EIGENSTRAT manual, it says The genotype file contains 1 line per SNP. Each line contains 1 character per individual: 0 means zero copies of reference allele. 1 means one copy of reference allele. 2 means two copies of reference allele. 9 means missing data. In the following, it is one row of my huge data.
rs4475691 C/T chr1 836671 CT CC CC CC CC CT CC TT CC CC CC NN (and so on...)
1st column: snp id
2nd column: alleles
3rd column: chromosome
4th column: position
and the rest is patients genotype. I know NN is for missing data and it should be encoded as 9 according to EIGENSTRAT format, but I am not sure for CT, CC and TT.
not sure of what you exactly need. Do you need a code to turn this into Eigenstrat ?
If the question is how to recode CC, TT and CT, then you choose one allele as the reference - you could choose the most frequent for instance or, here, take the first allele in the your line - I think, I do not know what format it is ...
basically, 0 1 and 2 are the number of non-reference allele (something chosen arbitrary - could be the other allele) in the genotype. The idea is to create a "quantitative" trait for each SNP and apply a PCA-based analysis.
Hanif : you are right and I am sadly wrong. Was a "typo" in the sense that C is the reference allele.
Sorry about that - I am going to to vote a -1 for my message.
On the other side, for the PCA and clustering here, I'd tend to say that the order is not so important - but better be straigth and do things
The answer from genotepes is fine. Hanif's comment is also OK, but we don't really know from the limited info which is the true reference allele and which is the derived.
In the case of EIGENSTRAT, any heterozygous genotype will be coded by 1 because it has one copy of the reference allele - and one copy of the derived.
Hello,
Maybe my question is simple or silly question, but I need to ask you how should identify genotype of individuals. Actually, I am doing an association study, for that I have got sequence of each individual related to my desired gene. I have done SNP analysis and now I don’t know the next step in order to genotyping. Could you please assist me?
No, I am not looking for code. I didn't get the idea of behind the encoding genotypes as 0,1 or 2. For instance, why did you set 1 to genotype CT?
Yes? CT is set 1.
basically, 0 1 and 2 are the number of non-reference allele (something chosen arbitrary - could be the other allele) in the genotype. The idea is to create a "quantitative" trait for each SNP and apply a PCA-based analysis.
Hanif : you are right and I am sadly wrong. Was a "typo" in the sense that C is the reference allele.
Sorry about that - I am going to to vote a -1 for my message. On the other side, for the PCA and clustering here, I'd tend to say that the order is not so important - but better be straigth and do things
Christian
Actually since it's a C/T SNP, CC = 2, CT = 1, TT = 0, NN = 9