Question

Converting GL value to GT format of vcf data

0

Entering edit mode

6.5 years ago

monjura.afrin • 0

I have a vcf file where data has following structure.

##fileformat=VCFv4.1
##fileDate=20140609
##source=GAP VCF Generation V1.0; imputation by IMPUTE2 using 1000 genome Phase 3 as reference
##reference=Homo_sapiens_assembly19_v1
##INFO=(ID=EXP_FREQ_A1,Number=.,Type=Float,Description="Expected frequency of allele coded 1")
##INFO=(ID=IMPINFO,Number=.,Type=Float,Description="Measure of the observed statistical information associated with the allele frequency estimate")
##INFO=(ID=CERTAINTY,Number=.,Type=Float,Description="Average certainty of best-guess genotypes")
##INFO=(ID=TYPE,Number=1,Type=Integer,Description="0-imputed SNP;2-genotyped SNP")
##INFO=(ID=MISS,Number=.,Type=Float,Description="Missingness of SNP with posterior probability threshold of 0.9")
##INFO=(ID=HW,Number=.,Type=Float,Description="Hardy Weinberg Equilibrium P value")
##FORMAT=(ID=GT:,Number=1,Type=String,Description="Best Guessed Genotype with posterior probability threshold of 0.9")
##FORMAT=(ID=GL,Number=3,Type=Float,Description="Posterior probability of 0/0, 0/1, and 1/1")
##FORMAT=(ID=DS,Number=1,Type=Float,Description="Dosage")
##FILTER=(ID=HWEPVALLT1E6,Description="Hardy-Weinberg p-Value <= 1E-6")
##FILTER=(ID=maf01,Description="allele frequency less than 1% or larger than 99%")
##FILTER=(ID=maf05,Description="allele frequency between 1-5% or 95-99%")
##FILTER=(ID=info4,Description="info from imputation less than 0.4")
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  GTEX-P4PP-0004-SM-2H2WW GTEX-PWO3-0004-SM-2IJGP GTEX-SNOS-0004-SM-325AE GTEX-PW2O-0004-SM-2IJGJ
1   30923   1_30923_G_T_b37 G   T   .   PASS    EXP_FREQ_A1=0.732;IMPINFO=0.446;CERTAINTY=0.730;TYPE=0;MISS=0.7867;HW=0.31695   GT:GL:DS    ./.:0.006,0.483,0.511:1.505 ./.:0.210,0.498,0.292:1.082 0/1:0.012,0.962,0.026:1.014 ./.:0.050,0.38,0.57:1.520   
1   51479   1_51479_T_A_b37 T   A   .   PASS    EXP_FREQ_A1=0.182;IMPINFO=0.405;CERTAINTY=0.779;TYPE=0;MISS=0.6600;HW=0.87518   GT:GL:DS    ./.:0.514,0.481,0.005:0.491 ./.:0.658,0.311,0.031:0.373 0/0:0.991,0.009,0:0.009 ./.:0.243,0.55,0.207:0.964

Now for every SNP in each sample I want to get the genotype(GT). When an entry has 0/0 or 0/1 or 1/1, I already have that information. But for many entry, GT is uncertain or not assigned indicated by ./. In such case I want to assign a genotype to that entry. I have used genomic likelihood(GL) to calculate that based on which likelihood have highest value. But I am now concerned whether I am doing it correctly. If a snp/sample has likelihood 0.48 to be 0/1 and 0.51 likelihood of being 0/0, my method would assign 0/0. But these two values are very close.

What is the standard cutoff on GL to assign a genotype to a sample for a SNP?
Since I want to relax the threshold a bit what criteria I should apply?

SNP genotype • 2.3k views

ADD COMMENT • link updated 6.5 years ago by finswimmer 16k • written 6.5 years ago by monjura.afrin • 0

0

Entering edit mode

perhaps not a solution but a comment.

1) ideally, you should not put cutoffs on GL but rather carry the uncertainty in the analyses 1) part 2. Ok honestly, few tools do this. So what we could be comfortable with the ratio between most likely genotype / second most likely genotype. What should the cutoff be? Whatever will make you sleep well at night. Do you want to the most likely to be 500X, 1000X or 10000X more likely that the second most likely. Well, use that. The less stringent you are, the more you will carry noise. The more stringent you are, the more you will kill off perfectly good data and potentially drag bizarre biases. 2) see 1) part 2

ADD REPLY • link 6.5 years ago by Gabriel R. ★ 2.9k