Funcotator gnomAD incoherent number of output fields
0
0
Entering edit mode
6 months ago

Hello

Funcotator germline with gnomAD is producing incoherent number of output fields. I am showing an example for a variant at chr1:598934

GATK version used: 4.5.0.0

I have called germline variants and performed Variant Quality Score Recalibration using GATK as por best practices, I then filter only allowing PASS variants and indexing it with:

bcftools view \
  -f 'PASS,.' \
  -O z \
  --threads 8 \
  -o output_filtered.vcf.gz INDEL_SNPS.applyModel.vcf.gz\
  && tabix -p vcf output_filtered.vcf.gz

Generating a file output_filtered.vcf.gz containing for example this line:

chr1    598934  .       CGGG    CG,C,*  374.02  PASS    AC=2,3,5;AF=0.083,0.125,0.208;AN=24;BaseQRankSum=0.674;DP=45;ExcessHet=0.0031;FS=0;InbreedingCoeff=0.3671;MLEAC=3,6,9;MLEAF=0.125,0.25,0.375;MQ=48.38;MQRankSum=-1.96;NEGATIVE_TRAIN_SITE;QD=17;ReadPosRankSum=-0.126;SOR=0.724;VQSLOD=-4.486;culprit=DP       GT:AD:DP:GQ:PGT:PID:PL:PS       ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       0/0:5,0,0,0:5:9:.:.:0,9,135,9,135,135,9,135,135,135:.        ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       0/0:1,0,0,0:1:3:.:.:0,3,11,3,11,11,3,11,11,11:.      2|2:0,0,2,0:2:6:1|1:598934_CGGG_C:90,90,90,6,6,0,90,90,6,90:598934   ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.    ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0:.       ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.   ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.        ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       0/0:3,0,0,0:3:6:.:.:0,6,90,6,90,90,6,90,90,90:.      ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       0/0:4,0,0,0:4:0:.:.:0,0,29,0,29,29,0,29,29,29:.      ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       3|3:0,0,0,2:2:6:1|1:598932_GGC_G:70,70,70,70,70,70,6,6,6,0:598932    1|1:0,4,0,0:4:12:1|1:598934_CGG_C:180,12,0,180,12,180,180,12,180,180:598934  0/0:3,0,0,0:3:3:.:.:0,3,45,3,45,45,3,45,45,45:. 0|2:3,0,3,0:6:99:0|1:598915_C_T:117,126,252,0,126,117,126,252,126,252:598915 0/0:5,0,0,0:5:12:.:.:0,12,180,12,180,180,12,180,180,180:.    0|3:2,0,0,4:6:15:0|1:598932_GGC_G:115,120,147,120,147,147,0,27,27,15:598932 3|3:0,0,0,2:2:6:1|1:598932_GGC_G:90,90,90,90,90,90,6,6,6,0:598932        ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.    ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.    ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0:.       ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.   ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.        ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.       ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0:.

I use FUNCOTATOR germline (enabling gnomAD) data sources as specified here (https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial#1.1.3)

I annotate it using FUNCOTATOR and index the file using this command:

gatk --java-options "-Xmx30g -XX:ParallelGCThreads=8" Funcotator \
  --variant output_filtered.vcf.gz \
  -R /data/genome/hg38ucsc/hg38_no_alt.fa \
  --ref-version hg38 \
  --data-sources-path /data/genome/hg38ucsc/annotations/GATK_Funcotator_db/germline/funcotator_dataSources.v1.8.hg38.20230908g/ \
  -O final_annotated.vcf.gz \
  --output-file-format VCF \
&& tabix -p vcf final_annotated.vcf.gz

Which gives me the following line for the variant:

chr1    598934  .       CGGG    CG,C,*  374.02  PASS    AC=2,3,5;AF=0.083,0.125,0.208;AN=24;BaseQRankSum=0.674;DP=45;ExcessHet=0.0031;FS=0;FUNCOTATION=[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GG|GG|-|g.chr1:598935_598937delGG|ENST00000634833.2|-|||c.e3-2464CCCG>CG|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|],[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GGG|GGG|-|g.chr1:598935_598937delGGG|ENST00000634833.2|-|||c.e3-2464CCCG>G|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|],[|||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||]#[ENSG00000230021|hg38|chr1|598934|598937|COULD_NOT_DETERMINE||NA|CGGG|CGGG|*||ENST00000634337.2|+||||||0.5781637717121588|AAGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634833.2_COULD_NOT_DETERMINE||||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||];InbreedingCoeff=0.3671;MLEAC=3,6,9;MLEAF=0.125,0.25,0.375;MQ=48.38;MQRankSum=-1.96;NEGATIVE_TRAIN_SITE;QD=17;ReadPosRankSum=-0.126;SOR=0.724;VQSLOD=-4.486;culprit=DP      GT:AD:DP:GQ:PGT:PID:PL:PS       ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0      0/0:5,0,0,0:5:9:.:.:0,9,135,9,135,135,9,135,135,135     ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0      0/0:1,0,0,0:1:3:.:.:0,3,11,3,11,11,3,11,11,11   2|2:0,0,2,0:2:6:1|1:598934_CGGG_C:90,90,90,6,6,0,90,90,6,90:598934   ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0      ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0      ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 0/0:3,0,0,0:3:6:.:.:0,6,90,6,90,90,6,90,90,90        ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 0/0:4,0,0,0:4:0:.:.:0,0,29,0,29,29,0,29,29,29        ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 3|3:0,0,0,2:2:6:1|1:598932_GGC_G:70,70,70,70,70,70,6,6,6,0:598932    1|1:0,4,0,0:4:12:1|1:598934_CGG_C:180,12,0,180,12,180,180,12,180,180:598934  0/0:3,0,0,0:3:3:.:.:0,3,45,3,45,45,3,45,45,45   0|2:3,0,3,0:6:99:0|1:598915_C_T:117,126,252,0,126,117,126,252,126,252:598915 0/0:5,0,0,0:5:12:.:.:0,12,180,12,180,180,12,180,180,180 0|3:2,0,0,4:6:15:0|1:598932_GGC_G:115,120,147,120,147,147,0,27,27,15:598932  3|3:0,0,0,2:2:6:1|1:598932_GGC_G:90,90,90,90,90,90,6,6,6,0:598932    ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0      ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:1,0,0,0:1:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0      ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0 ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0      ./.:0,0,0,0:0:0:.:.:0,0,0,0,0,0,0,0,0,0

When I convert it to table using:

gatk --java-options "-Xmx30g -XX:ParallelGCThreads=8" VariantsToTable \
  -R /data/genome/hg38ucsc/hg38_no_alt.fa \
  -V final_annotated.vcf.gz \
  -F CHROM \
  -F POS \
  -F ID \
  -F TYPE \
  -F REF \
  -F ALT \
  -GF GT \
  -GF AD \
  -GF DP \
  -GF AC \
  -GF AN \
  -GF AF \
  -F FILTER \
  -F FUNCOTATION \
  -O final_table.tsv

When I see variants, my understanding is that the FUNCOTATION column contains the annotations, within this column, annotation values are separated by "|" and in some cases, there are a list of annotations in the format [ annotation1_value1| annotation1_value2|annotation1_value3|...],[|annotation2_value1| annotation2_value2|annotation2_value3|...]

The annotation fields, I look them up in the previous step result (Funcotator) in the header of the file final_annotated.vcf.gz after  Funcotation fields are:

##INFO=<ID=FUNCOTATION,Number=A,Type=String,Description="Functional annotation from the Funcotator tool.  Funcotation fields are: Gencode_43_hugoSymbol|Gencode_43_ncbiBuild|Gencode_43_chromosome|Gencode_43_start|Gencode_43_end|Gencode_43_variantClassification|Gencode_43_secondaryVariantClassification|Gencode_43_variantType|Gencode_43_refAllele|Gencode_43_tumorSeqAllele1|...

I count the number of "|" separated fields in this part of the header, totalling 129 fields.

So when I look into the FUNCOTATION column in my variants table (final_table.tsv) I should see for each line, integer multiples of 129 which I can map to the annotation field descriptors. However, some lines do not obey this and I have no way of mapping the values to the annotation field descriptors. For example, line 153 (just showing the FUNCOTATION column:

[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GG|GG|-|g.chr1:598935_598937delGG|ENST00000634833.2|-|||c.e3-2464CCCG>CG|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||1.77576e-01|2.34347e-01|2.30930e-01|2.36943e-01|2.06667e-01|2.04082e-01|2.09150e-01|1.06742e-01|6.25000e-02|1.16438e-01|2.62480e-01|2.75463e-01|2.55556e-01|1.74644e-01|1.35786e-01|1.40000e-01|1.31179e-01|1.79873e-01|1.46222e-01|1.51056e-01|1.44244e-01|1.47727e-01|1.44092e-01|1.45833e-01|0.00000e+00|1.38037e-01|1.31579e-01|1.45161e-01|2.62480e-01|1.62877e-01||1|534314|false|false|rs568786854|],[ENSG00000230021|hg38|chr1|598935|598937|RNA||DEL|GGG|GGG|-|g.chr1:598935_598937delGGG|ENST00000634833.2|-|||c.e3-2464CCCG>G|||0.5781637717121588|AGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634337.2_RNA||||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|]#[|||||||||||||||||||||||||||false|||||||||||||||||||||||||||||||||||||||||||2.84588e-01|4.09338e-01|4.08215e-01|4.10191e-01|2.73333e-01|2.61905e-01|2.84314e-01|1.96629e-01|1.25000e-01|2.12329e-01|2.55233e-01|2.40741e-01|2.62963e-01|2.82344e-01|2.48865e-01|2.43478e-01|2.54753e-01|2.86346e-01|2.30391e-01|2.43662e-01|2.32354e-01|2.28896e-01|2.27866e-01|2.13141e-01|0.00000e+00|2.50000e-01|2.36842e-01|2.64516e-01|4.09338e-01|2.63374e-01||1|534314|false|false|rs568786854|],[|||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||]#[ENSG00000230021|hg38|chr1|598934|598937|COULD_NOT_DETERMINE||NA|CGGG|CGGG|*||ENST00000634337.2|+||||||0.5781637717121588|AAGAACACGGCGGGGGGGGGGGCG|ENSG00000230021_ENST00000634833.2_COULD_NOT_DETERMINE||||||||||||||||||||||||||||false||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false||]

The structure is:

[]#[],[]#[],[]#[]

So first, this seems to also use the separator "#" as well as "," but the problem is that the number of fields in each [] is:

[129]#[107],[129]#[107],[107]#[129]

How can I map the values to the annotation field descriptors??

gatk gnomAD Funcotator • 215 views
ADD COMMENT

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6