GTEx genotype data
0
0
Entering edit mode
8.3 years ago

Hi, I am working on the GTEx data and download this file

OMNI_2.5M_5M_450Indiv_chr1to22_phased_genot_imput_info04_maf01_HWEp1E6_ConstrVarIDs.vcf

I want to get the genotype for specific samples and make the format of the genotype be (0,1,2). I found the GT of this vcf has so many missing values.

My question is how to deal with these missing values when convert to genotype matrix? Is there any easier way to extract the genotype data of GTEx

1 30923 1_30923_G_T_b37 G T . PASS EXP_FREQ_A1=0.742;IMPINFO=0.435;CERTAINTY=0.847;TYPE=0;MISS=0.8067;HW=0.24 GT:GL:DS .|.:0.006,0.483,0.510:1.504 .|.:0.047,0.365,0.589:1.542 0|1:0.013,0.960,0.028:1.015 .|.:0.064,0.487,0.449:1.385 1|1:0.000,0.041,0.959:1.959 .|.:0.612,0.379,0.010:0.398 .|.:0.002,0.149,0.850:1.848 .|.:0.007,0.485,0.508:1.501 .|.:0.031,0.289,0.681:1.650 .|.:0.003,0.207,0.789:1.786 .|.:0.007,0.488,0.504:1.497 .|.:0.009,0.217,0.774:1.765 .|.:0.003,0.260,0.736:1.733 .|.:0.252,0.508,0.240:0.988 .|.:0.276,0.508,0.217:0.941 .|.:0.084,0.:|1:0.065,0.904,0.031:0.966 0|0:0.993,0.007,0.000:0.007 .|.:0.488,0.445,0.066:0.578 0|1:0.046,0.947,0.008:0.962 .|.:0.733,0.265,0.003:0.270 .|.:0.806,0.192,0.002:0.196 .|.:0.637,0.357,0.007:0.370 .|.:0.693,0.303,0.003:0.310 .|.:0.014,0.397,0.588:1.574 .|.:0.743,0.245,0.012:0.269 0|0:0.958,0.041,0.000:0.042 .|.:0.870,0.128,0.002:0.132 0|0:0.958,0.041,0.000:0.042 .|.:0.611,0.354,0.036:0.425 .|.:0.760,0.226,0.014:0.254 .|.:0.843,0.156,0.002:0.159

SNP • 2.9k views
ADD COMMENT
0
Entering edit mode

You could first make plink files of the vcf file and then extract or remove the missing values. One downside of this that you gonna miss the bi-allelic variants...

ADD REPLY
0
Entering edit mode

For someone who is not familiar with GTEx data this question is totally unclear. could you please specify:

  • where is the information about the genotype in your file?
  • what you define as genotype matrix?

I think your problem can be easily solved by a grep/sed by the way. Also, beware of posting sensitive (aka, human) data on the internet, if you got them by specific access to via user login to dbGaP portal. In this case, you can produce an input file looking like the one you would like to analyze, but with no real data (for your own security).

ADD REPLY
0
Entering edit mode

You can filter out low quality SNPs. That might be able to remove most case with missing genotypes. The remaining you can mark as missing data (NA or -1 say). Most statistical analysis you would do with the resulting data after should take the missing data into account. (For example if I were to do eQTL analysis, I would use MatrixEQTL to carry out linear regression based analysis which handles missing data.)

ADD REPLY

Login before adding your answer.

Traffic: 2556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6