How are features extracted and encoded from a genotype matrix / VCF file?
1
0
Entering edit mode
8.3 years ago
William ★ 5.3k

How are features extracted and encoded from a genotype matrix / VCF file for downstream statistical purposes?

Are genotypes encoded as

HOM_REF   = 0
HET       = 1
HOM_ALT   = 2

This preserved a measure of distance between the genotypes (distance is 1 between HET and HOM_REF and HOM_ALT). But wat is done with missing genotypes or are they set to -1 or -99 or something?

Or is one hot encoding used per variant to encode the 4 possible genotypes?

MISSING   = [1000]
HOM_REF   = [0100]
HET       = [0010]
HOM_ALT   = [0001]

This loses the measure of distance between the genotypes but includes the missing genotype.

Is the 0,1,2 or the one hot encoded matrix then converted to a sparse matrix to save disk/memory storage and computation cost?

ie. Only storing the HET and HOM_ALT genotypes as (index, value) tuples, assuming the rest is HOM_REF. This can save 90% of the disk and memory storage.

In the case of the 0,1,2 encoded matrix a sparse matrix would be problematic because you can't differentiate between MISSING and HOM_REF?

vcf statistics feature extraction • 2.3k views
ADD COMMENT
0
Entering edit mode
2.8 years ago
P ▴ 10

Hi, Did you ever have an answer for this issue? I am trying to do something similar. Thanks!

ADD COMMENT

Login before adding your answer.

Traffic: 1648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6