dbNSFP in-silico predictor results counts not-equal to number to alternatives in vcf file
0
1
Entering edit mode
3 months ago
Lukas ▴ 130

I would love to ask a question regarding unequal counts in the dbNSFP database and data annotated by the hs38DH reference sequence.

I would love to ask a question regarding unequal counts in the dbNSFP database and data annotated by the hs38DH reference sequence.

The data was reannotated at 2021 using the GRCh38 p13 reference with snpEFF and dbNSFP. The data was a targeted gene panel for schizophrenia.

Variations were called with freeBayes v1.3.2-44-gfce9620.

Because my data analysis didn't work with multi-variation data, I decided to improve it (technically pick max value inside the cell, creating upward bias - 10,25.5,7.52 my analysis would be: every variation as 25.5).

I used bcftools norm -m-both -o output.vcf.gz -Oz input.vcf.gz with the intention of separating every variant on separate lines to make analysis more precise. I got multiple errors like this after that command:

Error: wrong number of fields in INFO/dbNSFP_PHRED at chr1:1034085, expected 6, found 4
Error: wrong number of fields in INFO/dbNSFP_1000Gp3_AF at chr1:32695609, expected 2, found 1

After deleting multiple columns of dbNSFP data I found out that from 13/31 in-silico prediction tools data have inconcistencies with counts of alternatives and results of dbNSFP data. When I deleted them, it splited without problems.

=="INFO/dbNSFP_PHRED" ==  ## CADD_phred values
=="INFO/dbNSFP_1000Gp3_AF"==
=="INFO/dbNSFP_REVEL_score" ==
=="INFO/dbNSFP_VEST4_score" == 
=="INFO/dbNSFP_DANN_score"==
=="INFO/dbNSFP_Eigen-phred_coding"==
=="INFO/dbNSFP_Eigen-PC-phred_coding" ==
=="INFO/dbNSFP_LINSIGHT"==
=="INFO/dbNSFP_phastCons100way_vertebrate"==
=="INFO/dbNSFP_gnomAD_genomes_AF"==
=="INFO/dbNSFP_1000Gp3_EUR_AF"==
=="INFO/dbNSFP_ExAC_NFE_AF"==
=="INFO/dbNSFP_gnomAD_genomes_NFE_AF"==

After that i found out with vcf-validator, that this problems is not regarding only dozens variatins, but rather thousends.

So I tried to reannotate that with the dbNSFP database GRCh38.p13 v.a4.2 from SnpSift. It didnĀ“t help; I got the same message. Then I decided to remove variations without full samples Genotypes, considering it as a possible indication of CNVs etc., but the error is still showing.

Has anyone already solved this problem where the number of predictive tool results not matching the number of alternative alleles, even though the VCF header includes a condition confirming that these numbers should match? If so, did you use the modified VCF or did you create a new one?

I'm just asking if the discrepancy in data with the header is effectively a matter of the VCF being unusable.



appendix:

I have tried even delete all multivariational coordinates from my vcf with bcftools view -T ^pos.txt -o output.vcf -Ov input.vcf to check, if the differences wont affect even non-multivariation position and again check that out with vcf-validator.

Example:

INFO field at chr6:166858228 .. INFO tag [dbNSFP_SIFT_pred=T,T,T] expected different number of values (expected 1, found 3),INFO tag [dbNSFP_Polyphen2_HDIV_pred=B,B,.] expected different number of values (expected 1, found 3),INFO tag [dbNSFP_FATHMM_pred=T,T,T] expected different number of values (expected 1, found 3),INFO tag [dbNSFP_LIST_S2_pred=T,T,T] expected different number of values (expected 1, found 3),INFO tag [dbNSFP_Polyphen2_HVAR_pred=B,B,.] expected different number of values (expected 1, found 3),INFO tag [dbNSFP_PROVEAN_pred=N,N,N] expected different number of values (expected 1, found 3),INFO tag [dbNSFP_MutationTaster_pred=P,P] expected different number of values (expected 1, found 2),INFO tag [dbNSFP_VEST4_score=0.106,0.252,.] expected different number of values (expected 1, found 3)
INFO field at chr7:1091742 .. INFO tag [dbNSFP_SIFT_pred=D,D,D,D,D,.] expected different number of values (expected 1, found 6),INFO tag [dbNSFP_Polyphen2_HDIV_pred=B,.,B,B,B,.] expected different number of values (expected 1, found 6),INFO tag [dbNSFP_MutationAssessor_pred=N,.,N,N,N,.] expected different number of values (expected 1, found 6),INFO tag [dbNSFP_FATHMM_pred=T,T,T,T,T,.] expected different number of values (expected 1, found 6),INFO tag [dbNSFP_LIST_S2_pred=.,T,T,.,.,T] expected different number of values (expected 1, found 6),INFO tag [dbNSFP_Polyphen2_HVAR_pred=B,.,B,B,B,.] expected different number of values (expected 1, found 6),INFO tag [dbNSFP_PROVEAN_pred=N,D,N,N,N,.] expected different number of values (expected 1, found 6),INFO tag [dbNSFP_MutationTaster_pred=N,N,N,N,N,N,N] expected different number of values (expected 1, found 7),INFO tag [dbNSFP_VEST4_score=0.047,.,0.036,0.037,0.045,0.064] expected different number of values (expected 1, found 6)

So even single variations are affected by non equality of counst of values from dbNSFP anotations and alternative aleles.

question connected with Proper reanotation of genepanel with snpEff and dbNSFP

vcf • 234 views
ADD COMMENT

Login before adding your answer.

Traffic: 2727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6