How are features extracted and encoded from a genotype matrix / VCF file for downstream statistical purposes?
Are genotypes encoded as
HOM_REF = 0
HET = 1
HOM_ALT = 2
This preserved a measure of distance between the genotypes (distance is 1 between HET and HOM_REF and HOM_ALT). But wat is done with missing genotypes or are they set to -1 or -99 or something?
Or is one hot encoding used per variant to encode the 4 possible genotypes?
MISSING = [1000]
HOM_REF = [0100]
HET = [0010]
HOM_ALT = [0001]
This loses the measure of distance between the genotypes but includes the missing genotype.
Is the 0,1,2 or the one hot encoded matrix then converted to a sparse matrix to save disk/memory storage and computation cost?
ie. Only storing the HET and HOM_ALT genotypes as (index, value) tuples, assuming the rest is HOM_REF. This can save 90% of the disk and memory storage.
In the case of the 0,1,2 encoded matrix a sparse matrix would be problematic because you can't differentiate between MISSING and HOM_REF?