Question

One-hot encoding for PLINK or VCF

0

Entering edit mode

3.0 years ago

dmitry.s.kolobkov • 0

I want to write an autoencoder for SNP data. Is there an established way to one-hot-encode binary PLINK or VCF input? I believe that can be done by manipulating PLINK's bed file but am afraid to do something wrong.

By one-hot encoding I mean MISSING = [1000] HOM_REF = [0100] HET = [0010] HOM_ALT = [0001]

Thanks!

one-hot-encoding plink vcf • 1.7k views

ADD COMMENT • link updated 3.0 years ago by carlk ▴ 40 • written 3.0 years ago by dmitry.s.kolobkov • 0

0

Entering edit mode

why 4 bytes? PLINK's internal uses 2 byte per genotype, 00, 01, 11, and 10. Can't remember if 01 or 10 is missing.

ADD REPLY • link 3.0 years ago by Sam ★ 4.8k

0

Entering edit mode

yes but I need to use it to train a neural network so I need four separate variables for four categories of a single SNP

ADD REPLY • link 3.0 years ago by dmitry.s.kolobkov • 0

0

Entering edit mode

I see. Then I don't know if there's anything like that. Out of interest, can't you also represent the same data with 000, 001, 010 and 100? Though I guess being a multiple of 2 might be more efficient.

ADD REPLY • link 3.0 years ago by Sam ★ 4.8k

score 1 · Answer 1 · 2021-12-08

Here is what we do for FaSTLMM, which does linear mixed models, a generalization of linear regression. We treat the values as 0,1,2 (the count of the A1 alleles), then we standardize (for each variant, set mean to 0 and s.d. to 1). Finally, we set any missing values to 0.

Put another way, we treat the values as real-valued input, not as two binary inputs. I think these steps are common in the field. It was my impression that most neural nets are happy with real-valued input.

If you're working in Python, we have created bed-reader for reading PLINK bed files. Under the covers, it uses a fast multi-threaded Rust engine. It supports all Python indexing methods and you can slice data by individual (samples) and/or by variants.

Carl

Carl Kadie, Ph.D.

FaST-LMM & PySnpTools Team

(Microsoft Research, retired)

https://www.linkedin.com/in/carlk/