I wanted to know why is it correct to transform SNP data to 0, 1, 2 format using a reference allele, for example: SNP1 with C/T alleles, transformation rules: CC = 2, CT = 1, TT = 0, to later apply machine learning algorithms for predict a specific trait?
I ask this because giving this ordinal values to SNP data may affect greatly the result of a classification model, since in a way, we are giving "more importance" to diploid "CC" with a bigger value of 2, than to diploid TT with a value 0.
Wouldn't it be better and correct to transform the data into a binary format, where each SNP feature will be transformed to 4 binary features: SNP1_CC, SNP1_CT, SNP1_TC, SNP1_TT. Following this, the sample:
ID SNP1 SNP2 1 CC AG
Will be transformed to:
ID SNP1_CC SNP1_CT SNP1_TC SNP1_TT SNP2_GG SNP2_GA SNP2_AG SNP2_AA 1 1 0 0 0 0 0 1 0
I don't think because you transform it to categorical 0, 1, 2 that it's necessarily ranked 0 < 1 < 2 Could as well transform it to categorical "donkey" (homozygous reference), "pig" (homogygous variant) and "chicken" (heterozygous). It's a label.
I understand, what you say that it is just a re-labeling, but what I mean is to transform categorical data to numerical data, so I can apply ML methods that use numeric data and do not support categorical data. Is it still correct?
Absolutely not. A 2 is not double the effect of a 1 for a simple dominant trait. And only zero is affected in a simple recessive trait. A numeric-only ML algo will absolutely screw this up.
I'm confussed, sorry... So you say (0, 1, 2) as numeric data is an incorrect input for a ML algorithm? And a (0, 1) encoding would be more appropriate one?
No. Numeric input (0, 1, 2) is incorrect. Categorical input (0, 1, 2) is fine.
I already told you that I understand that categorical input (0, 1, 2) is ok, because this would be just relabeling the data. But this is not what I'm asking, I'm asking what kind of numerical transformation of SNP categorical data is better to later apply ML algorithms that use as input, only numeric data.
There is no appropriate transformation. You could argue that homozygous for the most prevalent allele is the least likely to be harmful and could be encoded as 0/neutral. But as John wrote, there are examples of heterozygous advantages compared to both homozygous types. So no general good rules. I like the idea of applying ML to variant data, but you should know that most likely most variants are harmless or with minimal effect... and just adding noise.
Every genetic haplotype has the potential to result in a totally different phenotype. It might be that 0/0 is bad, 0/1 makes you healthier, and 1/1 gives you sickle-cell anaemia.
More over, haplotypes in isolation might not make much sense either. 1/1 of allele A might cause cancer, but 1/1 of A and 1/1 of allele B might cancel each other out, and in the process protect you from other cancers. Bottom line, some assumptions and simplification of the real problem will have to be done in your model, and it's more important that you respect the assumptions made, than pick the "best" assumption and then pretend your model is the best possible model without any limitations. What i'm saying is, choosing to turn categorical data into continuous data will sensitise your model to diseases that work that way - and that might be a good thing. Choosing a model where every genotype is it's own independent observation might sensitise your model for more complex-trait diseases, and miss more obvious ones.