Hello
Funcotator germline with gnomAD is producing incoherent number of output fields. I am showing an example for a variant at chr1:598934
GATK version used: 4.5.0.0
I have called germline variants and performed Variant Quality Score Recalibration using GATK as por best practices, I then filter only allowing PASS variants and indexing it with:
bcftools view \
-f 'PASS,.' \
-O z \
--threads 8 \
-o output_filtered.vcf.gz INDEL_SNPS.applyModel.vcf.gz\
&& tabix -p vcf output_filtered.vcf.gz
Generating a file output_filtered.vcf.gz containing for example this line:
chr1 598934 . CGGG CG,C,* 374.02 PASS AC=2,3,5;AF=0.083,0.125,0.208;AN=24;BaseQRankSum=0.674;DP=45;ExcessHet=0.0031;FS=0;InbreedingCoeff=0.3671;MLEAC=3,6,9;MLEAF=0.125,0.25,0.375;MQ=48.38;MQRankSum=-1.96;NEGATIVE_TRAIN_SITE;QD=17;ReadPosRankSum=-0.126;SOR=0.724;VQSLOD=-4.486;culprit=DP GT:AD:DP:GQ:PGT:PID:PL:PS ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. 0/0:5,0,0,0:5:9:.:.:0,9,135,9,135,135,9,135,135,135:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. 0/0:1,0,0,0:1:3:.:.:0,3,11,3,11,11,3,11,11,11:. 2|2:0,0,2,0:2:6:1|1:598934_CGGG_C:90,90,90,6,6,0,90,90,6,90:598934 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. 0/0:3,0,0,0:3:6:.:.:0,6,90,6,90,90,6,90,90,90:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. 0/0:4,0,0,0:4:0:.:.:0,0,29,0,29,29,0,29,29,29:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. 3|3:0,0,0,2:2:6:1|1:598932_GGC_G:70,70,70,70,70,70,6,6,6,0:598932 1|1:0,4,0,0:4:12:1|1:598934_CGG_C:180,12,0,180,12,180,180,12,180,180:598934 0/0:3,0,0,0:3:3:.:.:0,3,45,3,45,45,3,45,45,45:. 0|2:3,0,3,0:6:99:0|1:598915_C_T:117,126,252,0,126,117,126,252,126,252:598915 0/0:5,0,0,0:5:12:.:.:0,12,180,12,180,180,12,180,180,180:. 0|3:2,0,0,4:6:15:0|1:598932_GGC_G:115,120,147,120,147,147,0,27,27,15:598932 3|3:0,0,0,2:2:6:1|1:598932_GGC_G:90,90,90,90,90,90,6,6,6,0:598932 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:. ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.
I use FUNCOTATOR germline (enabling gnomAD) data sources as specified here (https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial#1.1.3)
I annotate it using FUNCOTATOR and index the file using this command:
gatk --java-options "-Xmx30g -XX:ParallelGCThreads=8" Funcotator \
--variant output_filtered.vcf.gz \
-R /data/genome/hg38ucsc/hg38_no_alt.fa \
--ref-version hg38 \
--data-sources-path /data/genome/hg38ucsc/annotations/GATK_Funcotator_db/germline/funcotator_dataSources.v1.8.hg38.20230908g/ \
-O final_annotated.vcf.gz \
--output-file-format VCF \
&& tabix -p vcf final_annotated.vcf.gz
Which gives me the following line for the variant:
chr1 598934 . CGGG CG,C,* 374.02 PASS AC=2,3,5;AF=0.083,0.125,0.208;AN=24;BaseQRankSum=0.674;DP=45;ExcessHet=0.0031;FS=0;FUNCOTATION=[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GG|GG|-|g.chr1:598935_598937delGG|ENST00000634833.2|-|||c.e3-2464CCCG>CG|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|],[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GGG|GGG|-|g.chr1:598935_598937delGGG|ENST00000634833.2|-|||c.e3-2464CCCG>G|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|],[|||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||]#[ENSG00000230021|hg38|chr1|598934|598937|COULD_NOT_DETERMINE||NA|CGGG|CGGG|*||ENST00000634337.2|+||||||0.5781637717121588|AAGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634833.2_COULD_NOT_DETERMINE||||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||];InbreedingCoeff=0.3671;MLEAC=3,6,9;MLEAF=0.125,0.25,0.375;MQ=48.38;MQRankSum=-1.96;NEGATIVE_TRAIN_SITE;QD=17;ReadPosRankSum=-0.126;SOR=0.724;VQSLOD=-4.486;culprit=DP GT:AD:DP:GQ:PGT:PID:PL:PS ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 0/0:5,0,0,0:5:9:.:.:0,9,135,9,135,135,9,135,135,135 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 0/0:1,0,0,0:1:3:.:.:0,3,11,3,11,11,3,11,11,11 2|2:0,0,2,0:2:6:1|1:598934_CGGG_C:90,90,90,6,6,0,90,90,6,90:598934 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 0/0:3,0,0,0:3:6:.:.:0,6,90,6,90,90,6,90,90,90 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 0/0:4,0,0,0:4:0:.:.:0,0,29,0,29,29,0,29,29,29 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 3|3:0,0,0,2:2:6:1|1:598932_GGC_G:70,70,70,70,70,70,6,6,6,0:598932 1|1:0,4,0,0:4:12:1|1:598934_CGG_C:180,12,0,180,12,180,180,12,180,180:598934 0/0:3,0,0,0:3:3:.:.:0,3,45,3,45,45,3,45,45,45 0|2:3,0,3,0:6:99:0|1:598915_C_T:117,126,252,0,126,117,126,252,126,252:598915 0/0:5,0,0,0:5:12:.:.:0,12,180,12,180,180,12,180,180,180 0|3:2,0,0,4:6:15:0|1:598932_GGC_G:115,120,147,120,147,147,0,27,27,15:598932 3|3:0,0,0,2:2:6:1|1:598932_GGC_G:90,90,90,90,90,90,6,6,6,0:598932 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0
When I convert it to table using:
gatk --java-options "-Xmx30g -XX:ParallelGCThreads=8" VariantsToTable \
-R /data/genome/hg38ucsc/hg38_no_alt.fa \
-V final_annotated.vcf.gz \
-F CHROM \
-F POS \
-F ID \
-F TYPE \
-F REF \
-F ALT \
-GF GT \
-GF AD \
-GF DP \
-GF AC \
-GF AN \
-GF AF \
-F FILTER \
-F FUNCOTATION \
-O final_table.tsv
When I see variants, my understanding is that the FUNCOTATION column contains the annotations, within this column, annotation values are separated by "|" and in some cases, there are a list of annotations in the format [ annotation1_value1| annotation1_value2|annotation1_value3|...],[|annotation2_value1| annotation2_value2|annotation2_value3|...]
The annotation fields, I look them up in the previous step result (Funcotator) in the header of the file final_annotated.vcf.gz after Funcotation fields are:
##INFO=<ID=FUNCOTATION,Number=A,Type=String,Description="Functional annotation from the Funcotator tool. Funcotation fields are: Gencode_43_hugoSymbol|Gencode_43_ncbiBuild|Gencode_43_chromosome|Gencode_43_start|Gencode_43_end|Gencode_43_variantClassification|Gencode_43_secondaryVariantClassification|Gencode_43_variantType|Gencode_43_refAllele|Gencode_43_tumorSeqAllele1|...
I count the number of "|" separated fields in this part of the header, totalling 129 fields.
So when I look into the FUNCOTATION column in my variants table (final_table.tsv) I should see for each line, integer multiples of 129 which I can map to the annotation field descriptors. However, some lines do not obey this and I have no way of mapping the values to the annotation field descriptors. For example, line 153 (just showing the FUNCOTATION column:
[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GG|GG|-|g.chr1:598935_598937delGG|ENST00000634833.2|-|||c.e3-2464CCCG>CG|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|],[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GGG|GGG|-|g.chr1:598935_598937delGGG|ENST00000634833.2|-|||c.e3-2464CCCG>G|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|],[|||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||]#[ENSG00000230021|hg38|chr1|598934|598937|COULD_NOT_DETERMINE||NA|CGGG|CGGG|*||ENST00000634337.2|+||||||0.5781637717121588|AAGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634833.2_COULD_NOT_DETERMINE||||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||]
The structure is:
[]#[],[]#[],[]#[]
So first, this seems to also use the separator "#" as well as "," but the problem is that the number of fields in each [] is:
[129]#[107],[129]#[107],[107]#[129]
How can I map the values to the annotation field descriptors??