One-hot encoding for PLINK or VCF
1
0
Entering edit mode
3.0 years ago

I want to write an autoencoder for SNP data. Is there an established way to one-hot-encode binary PLINK or VCF input? I believe that can be done by manipulating PLINK's bed file but am afraid to do something wrong.

By one-hot encoding I mean MISSING = [1000] HOM_REF = [0100] HET = [0010] HOM_ALT = [0001]

Thanks!

one-hot-encoding plink vcf • 1.7k views
ADD COMMENT
0
Entering edit mode

why 4 bytes? PLINK's internal uses 2 byte per genotype, 00, 01, 11, and 10. Can't remember if 01 or 10 is missing.

ADD REPLY
0
Entering edit mode

yes but I need to use it to train a neural network so I need four separate variables for four categories of a single SNP

ADD REPLY
0
Entering edit mode

I see. Then I don't know if there's anything like that. Out of interest, can't you also represent the same data with 000, 001, 010 and 100? Though I guess being a multiple of 2 might be more efficient.

ADD REPLY
1
Entering edit mode
3.0 years ago
carlk ▴ 40

Here is what we do for FaSTLMM, which does linear mixed models, a generalization of linear regression. We treat the values as 0,1,2 (the count of the A1 alleles), then we standardize (for each variant, set mean to 0 and s.d. to 1). Finally, we set any missing values to 0.

Put another way, we treat the values as real-valued input, not as two binary inputs. I think these steps are common in the field. It was my impression that most neural nets are happy with real-valued input.

If you're working in Python, we have created bed-reader for reading PLINK bed files. Under the covers, it uses a fast multi-threaded Rust engine. It supports all Python indexing methods and you can slice data by individual (samples) and/or by variants.

  • Carl
ADD COMMENT

Login before adding your answer.

Traffic: 1747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6