I have a vcf file where data has following structure.
##fileformat=VCFv4.1
##fileDate=20140609
##source=GAP VCF Generation V1.0; imputation by IMPUTE2 using 1000 genome Phase 3 as reference
##reference=Homo_sapiens_assembly19_v1
##INFO=(ID=EXP_FREQ_A1,Number=.,Type=Float,Description="Expected frequency of allele coded 1")
##INFO=(ID=IMPINFO,Number=.,Type=Float,Description="Measure of the observed statistical information associated with the allele frequency estimate")
##INFO=(ID=CERTAINTY,Number=.,Type=Float,Description="Average certainty of best-guess genotypes")
##INFO=(ID=TYPE,Number=1,Type=Integer,Description="0-imputed SNP;2-genotyped SNP")
##INFO=(ID=MISS,Number=.,Type=Float,Description="Missingness of SNP with posterior probability threshold of 0.9")
##INFO=(ID=HW,Number=.,Type=Float,Description="Hardy Weinberg Equilibrium P value")
##FORMAT=(ID=GT:,Number=1,Type=String,Description="Best Guessed Genotype with posterior probability threshold of 0.9")
##FORMAT=(ID=GL,Number=3,Type=Float,Description="Posterior probability of 0/0, 0/1, and 1/1")
##FORMAT=(ID=DS,Number=1,Type=Float,Description="Dosage")
##FILTER=(ID=HWEPVALLT1E6,Description="Hardy-Weinberg p-Value <= 1E-6")
##FILTER=(ID=maf01,Description="allele frequency less than 1% or larger than 99%")
##FILTER=(ID=maf05,Description="allele frequency between 1-5% or 95-99%")
##FILTER=(ID=info4,Description="info from imputation less than 0.4")
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GTEX-P4PP-0004-SM-2H2WW GTEX-PWO3-0004-SM-2IJGP GTEX-SNOS-0004-SM-325AE GTEX-PW2O-0004-SM-2IJGJ
1 30923 1_30923_G_T_b37 G T . PASS EXP_FREQ_A1=0.732;IMPINFO=0.446;CERTAINTY=0.730;TYPE=0;MISS=0.7867;HW=0.31695 GT:GL:DS ./.:0.006,0.483,0.511:1.505 ./.:0.210,0.498,0.292:1.082 0/1:0.012,0.962,0.026:1.014 ./.:0.050,0.38,0.57:1.520
1 51479 1_51479_T_A_b37 T A . PASS EXP_FREQ_A1=0.182;IMPINFO=0.405;CERTAINTY=0.779;TYPE=0;MISS=0.6600;HW=0.87518 GT:GL:DS ./.:0.514,0.481,0.005:0.491 ./.:0.658,0.311,0.031:0.373 0/0:0.991,0.009,0:0.009 ./.:0.243,0.55,0.207:0.964
Now for every SNP in each sample I want to get the genotype(GT). When an entry has 0/0 or 0/1 or 1/1, I already have that information. But for many entry, GT is uncertain or not assigned indicated by ./. In such case I want to assign a genotype to that entry. I have used genomic likelihood(GL) to calculate that based on which likelihood have highest value. But I am now concerned whether I am doing it correctly. If a snp/sample has likelihood 0.48 to be 0/1 and 0.51 likelihood of being 0/0, my method would assign 0/0. But these two values are very close.
- What is the standard cutoff on GL to assign a genotype to a sample for a SNP?
- Since I want to relax the threshold a bit what criteria I should apply?
perhaps not a solution but a comment.
1) ideally, you should not put cutoffs on GL but rather carry the uncertainty in the analyses 1) part 2. Ok honestly, few tools do this. So what we could be comfortable with the ratio between most likely genotype / second most likely genotype. What should the cutoff be? Whatever will make you sleep well at night. Do you want to the most likely to be 500X, 1000X or 10000X more likely that the second most likely. Well, use that. The less stringent you are, the more you will carry noise. The more stringent you are, the more you will kill off perfectly good data and potentially drag bizarre biases. 2) see 1) part 2