I want to write an autoencoder for SNP data. Is there an established way to one-hot-encode binary PLINK or VCF input? I believe that can be done by manipulating PLINK's bed file but am afraid to do something wrong.
By one-hot encoding I mean
MISSING = [1000]
HOM_REF = [0100]
HET = [0010]
HOM_ALT = [0001]
I see. Then I don't know if there's anything like that. Out of interest, can't you also represent the same data with 000, 001, 010 and 100? Though I guess being a multiple of 2 might be more efficient.
Here is what we do for FaSTLMM, which does linear mixed models, a generalization of linear regression. We treat the values as 0,1,2 (the count of the A1 alleles), then we standardize (for each variant, set mean to 0 and s.d. to 1). Finally, we set any missing values to 0.
Put another way, we treat the values as real-valued input, not as two binary inputs. I think these steps are common in the field. It was my impression that most neural nets are happy with real-valued input.
If you're working in Python, we have created bed-reader for reading PLINK bed files. Under the covers, it uses a fast multi-threaded Rust engine. It supports all Python indexing methods and you can slice data by individual (samples) and/or by variants.
why 4 bytes? PLINK's internal uses 2 byte per genotype, 00, 01, 11, and 10. Can't remember if 01 or 10 is missing.
yes but I need to use it to train a neural network so I need four separate variables for four categories of a single SNP
I see. Then I don't know if there's anything like that. Out of interest, can't you also represent the same data with 000, 001, 010 and 100? Though I guess being a multiple of 2 might be more efficient.