I have a vcf file of a plant species obtained with GATK4 and I would like to impute few missing genotypes, including genotype probabilities or likelihoods in the output, so that I can then convert to BIMBAM format to run GEMMA for GWAS.
I tried BEAGLE (both versions 4.1 and 5.1) for imputation, adding the gp=true
option to include GP field in the output vcf file, but it does not do what I want. The output still only contains GT filed and no GP field.
Here is my command:
java -Xss5m -Xmx20g -jar beagle.jar gt=input.vcf gp=true out=imputed_example
And a small example of my input.vcf
file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TA_AM_01_01_F3_CC0_M1_1 TA_DE_01_02_F1_HC0_M1_1 TA_DE_01_04_F1_HC0_M1_1
Scaffold_1 2393 . C A 90.60 PASS AC=4;AF=0.010;AN=386;BaseQRankSum=0.00;DP=981;ExcessHet=0.0280;FS=2.077;InbreedingCoeff=0.2413;MQ=22.83;MQRankSum=0.319;QD=6.97;ReadPosRankSum=-6.740e-01;SOR=1.793 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 0/0:9,0:9:27:0,27,326 0/0:7,0:7:21:0,21,266
Scaffold_1 2413 . G T 210.92 PASS AC=7;AF=0.018;AN=398;BaseQRankSum=0.00;DP=1383;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.3551;MQ=22.52;MQRankSum=-1.383e+00;QD=16.22;ReadPosRankSum=-1.383e+00;SOR=0.666 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 0/0:10,0:10:30:0,30,391 0/0:7,0:7:21:0,21,292
Scaffold_1 2520 . A T 33956.70 PASS AC=170;AF=0.417;AN=408;BaseQRankSum=0.00;DP=2159;ExcessHet=0.0000;FS=2.541;InbreedingCoeff=0.8530;MQ=24.94;MQRankSum=-1.000e+00;QD=27.82;ReadPosRankSum=0.00;SOR=1.255 GT:AD:DP:GQ:PGT:PID:PL:PS ./.:0,0:0:.:.:.:0,0,0 1|1:0,10:10:30:1|1:8017321_A_T:408,30,0:8017321 0/0:8,0:8:24:.:.:0,24,308
Scaffold_1 2530 . C T 31680.20 PASS AC=167;AF=0.413;AN=404;BaseQRankSum=0.00;DP=1979;ExcessHet=-0.0000;FS=2.549;InbreedingCoeff=0.8469;MQ=25.10;MQRankSum=-1.280e+00;QD=31.17;ReadPosRankSum=0.319;SOR=1.291 GT:AD:DP:GQ:PGT:PID:PL:PS ./.:0,0:0:.:.:.:0,0,0 1|1:0,8:8:24:1|1:8017321_A_T:342,24,0:8017321 0/0:8,0:8:24:.:.:0,24,308
Does anyone know how to include GP field in BEAGLE output? Or any other software that can use phasing and LD to produce genotype probabilities or likelihoods for missing genotype calls?
Thanks very much
Can you provide some lines of the output of the vcf which you thought should have contained the GP values?
Sure, the output looks like this, with GT, but no GP values:
All of those snps were in the original VCF, so they won't have gps, since only imputed markers would have GPs. can you show some imputed SNPs in the output file?
Well, all those SNPs are missing in the first individual (TA_AM_01_01_F3_CC0_M1_1) in the input (GT --> "./.") and they are imputed and genotyped in the output (0|0). Or do you mean that GPs are given only for SNP absent in all individuals and present only in the reference panel (which I am not using because I don't have it for my species)?
I think the latter might be true, but I'll have a test run today on my PC to see if it is.