Hi, I am working on the GTEx data and download this file
OMNI_2.5M_5M_450Indiv_chr1to22_phased_genot_imput_info04_maf01_HWEp1E6_ConstrVarIDs.vcf
I want to get the genotype for specific samples and make the format of the genotype be (0,1,2). I found the GT of this vcf has so many missing values.
My question is how to deal with these missing values when convert to genotype matrix? Is there any easier way to extract the genotype data of GTEx
1 30923 1_30923_G_T_b37 G T . PASS EXP_FREQ_A1=0.742;IMPINFO=0.435;CERTAINTY=0.847;TYPE=0;MISS=0.8067;HW=0.24 GT:GL:DS .|.:0.006,0.483,0.510:1.504 .|.:0.047,0.365,0.589:1.542 0|1:0.013,0.960,0.028:1.015 .|.:0.064,0.487,0.449:1.385 1|1:0.000,0.041,0.959:1.959 .|.:0.612,0.379,0.010:0.398 .|.:0.002,0.149,0.850:1.848 .|.:0.007,0.485,0.508:1.501 .|.:0.031,0.289,0.681:1.650 .|.:0.003,0.207,0.789:1.786 .|.:0.007,0.488,0.504:1.497 .|.:0.009,0.217,0.774:1.765 .|.:0.003,0.260,0.736:1.733 .|.:0.252,0.508,0.240:0.988 .|.:0.276,0.508,0.217:0.941 .|.:0.084,0.:|1:0.065,0.904,0.031:0.966 0|0:0.993,0.007,0.000:0.007 .|.:0.488,0.445,0.066:0.578 0|1:0.046,0.947,0.008:0.962 .|.:0.733,0.265,0.003:0.270 .|.:0.806,0.192,0.002:0.196 .|.:0.637,0.357,0.007:0.370 .|.:0.693,0.303,0.003:0.310 .|.:0.014,0.397,0.588:1.574 .|.:0.743,0.245,0.012:0.269 0|0:0.958,0.041,0.000:0.042 .|.:0.870,0.128,0.002:0.132 0|0:0.958,0.041,0.000:0.042 .|.:0.611,0.354,0.036:0.425 .|.:0.760,0.226,0.014:0.254 .|.:0.843,0.156,0.002:0.159
You could first make plink files of the vcf file and then extract or remove the missing values. One downside of this that you gonna miss the bi-allelic variants...
For someone who is not familiar with GTEx data this question is totally unclear. could you please specify:
I think your problem can be easily solved by a grep/sed by the way. Also, beware of posting sensitive (aka, human) data on the internet, if you got them by specific access to via user login to dbGaP portal. In this case, you can produce an input file looking like the one you would like to analyze, but with no real data (for your own security).
You can filter out low quality SNPs. That might be able to remove most case with missing genotypes. The remaining you can mark as missing data (NA or -1 say). Most statistical analysis you would do with the resulting data after should take the missing data into account. (For example if I were to do eQTL analysis, I would use MatrixEQTL to carry out linear regression based analysis which handles missing data.)