Prediction tools summary - zero values
1
3
Entering edit mode
13 months ago
Lukas ▴ 130

Hi,

I am doing summary of 15 prediction tools for my filtered variations into one overal result to check patogenicity of that tools. I have 5 numerical predictions scores and 10 with letters as their values under ACMG recommendation.

The 5 numerical prediction tools are REVEL, CADD_phred, Eigen_phred, Eigen_PC_phred and DANN.

In my dataset I have "." as missing value set by snpEff annotation. I decided to set it as U (Uknown) for the ACMG recommened values and 0.0 as numerical.

However I unintentionaly set 0.0 value, which is represented as benign in whole 5 numerical tools as "Unknown" value. I actually thought that it does not matter, because 0.0 value is highly unlikely and I am focusing on patogenic values. But now, I am not sure, if this reasoning is actually valid.

Is it statisticaly aceptable to let it like that?

statistics • 1.4k views
ADD COMMENT
0
Entering edit mode

luffy Sorry, but i would love to know your take on my data analysis what I did mention to you few days ago. If you consider sharing your thoughs i would be honored. Still even if you dont I want to thank you for your time and suggestion earlier. Have nice day.

ADD REPLY
0
Entering edit mode

Hello, Lucas. Would you be so kind to share the link or id of your Schizophrenia vcf dataset? I am a student in masters degree and I am interested in polygenic diseases for my thesis. Or is this dataset private?

ADD REPLY
3
Entering edit mode
13 months ago
luffy ▴ 130

@Lukas, Using 15 prediction tools for filteration step is not recommended, still if you want to go ahead with it then you could apply cut off based on authors's recommendation or general recommendation for each tool not to apply same cut off to all the tools. And coming to your question of 0.0 values for the numerical tools which represents benign, is not right way, all ACMG criteria needs to focused.

ADD COMMENT
0
Entering edit mode

Thank you for your answer. I really appreciate it. For your recomendation towards 15 prediction tools, I don't think it is right. My goal is to get summary value from 15 tools, to increase precision of my variation interpretation. I know that the combinations of tools might be quite problematic, however I have annotated .vcf file with snpEff and in my opinion I don't have much a choice.

For the authors cut off, it is actually pretty hard to find any cut offs for multiple tools like Eigen (no author recommendation at all) and REVEL (on Ensemble - recommendet to get your own cut off dependent on your needs). Still I have vcf with only 95 subjects. So if I would apply the recommendet REVEL value (0.5) I get not even 1% of my data.

With 0.0 values I see your point. Still I don't know what to do then. Because of annotation by snpEff and dbNSFP database not direct relation to this tool, I was not able to extract prediction tools data per variation. So I got multiple predictions for tool per alleles and I didn't see other option then somehow the summarise the values. Do you think than that it would be better practice to filterout values without values for all my picked data and then sum up them into one value?

ADD REPLY
1
Entering edit mode

@Lukas, I am not sure about your objective of your study, i assumed it as clinical interpretation of variants, if thats the case then considering all 15 tools is still not recommended, this is based on the excellent work by ClinGen, kindly go through this paper This will also answer you question of cut off they have calibrated the tools and provided with their own cut offs.

For your question multiple predictions for tool per alleles you could choose the most deleterious/highest score, if you are still going ahead with summary of 15 tools.

It would been better in the future if would export the annotation results into a tab delimited file, these filtering steps could easily applied using python or R codes

So if I would apply the recommendet REVEL value (0.5) I get not even 1% of my data.

Could be please elaborate the filtering steps you are considering

Hope this helps

ADD REPLY
0
Entering edit mode

luffy Thank you for your response.

For objective of my study. I am student in master degree study of neurobiology focused on neurophysiology. The goal of this study is filtering out potentional pathogenic variant from .vcf file of genetic data from 95 subjects related to Schizophrenia annotated by snpEff tool. So it is basic reaserch without clinical implications.

Because the schizophrenia is polygenetic neurodevelopmental dissorder, I am not able to use ACMG recommendation directly. After I found out that Clingen don't have recommendation for polygenetic diseases yet, I didn't have much choice and have to create something like my own metascore to access impact of the picked variations.

So reason for the assumtion of REVEL was. The .vcf file is result from targeted whole-exone or whole-gene sequencing and with little counts of subject it resulted into no-values for prediction tools. In case of REVEL, I have generaly 78578 variations for WES .vcf, but only aproximatly 4000 variants have values and 2000 have up to 0.258 value. Because this REVEL-Ensemble

We strongly recommend the actual score is used when assessing a variant and a cut-off appropriate to your requirements is chosen.

I decided for general procedure for all numerical prediction tools. So for numeric valued prediction tools i decided to create treshold according quantile 75 of the values without no-values data - compromise, increase accuracy for patogenic predictions. This was the main point of my quastion. I am not really sure, if I even can do that from statistical point of view.

It would been better in the future

It's actually imposible. To the best of my knowlege for filtering out data with SnpSift is needed to decide delimeter for division alleles data for specific column (like REVEL data from - alele1, alelel2 || 0.125,0.562; only one alele || 0.251) like comma. So I am fitering the file and analysing it with pandas, scipy and numpy in jupyter notebook, but had to somehow sum it in one value. But because there is option for possible outliners, I didn't pick max value and used mean insted for data with numerical values and columns with data according ACMG recommendation as value with maximal frequency (for example tools with D| T values like SIFT).

ADD REPLY
0
Entering edit mode

For steps of my filtering:

  1. filtering out values for downstream analysis from main .vcf file - genetic coordinate with effects, and prediction tools: CADD_phred, REVEL, DANN, Eigen,Eigen_PC, MetaSVM, MetaLR, LRT, SIFT, FATHMM, PROVEAN, MutationTaster, MutationAssessor, PolyPhen2-HVIR + PolyPhen2-HVAR || goal: pick potential patogenic coding missense variations
  2. "." change empty values - tools with ACMG recomendation change to "U"; numerical tools change to 0.0
  3. sum multiple values and change it accordingly - a) split multiplevalue score result and change "." accordingly b) if numeric - use mean() else use maximal frequency 4.change values into 3 categories - Patogenic (P), Benign (B), U (unsign/unknown) a) numerical value according its treshold (get from data of tool without "." values - then CADD_phred = 15 (used generaly threw analysis), others quantile 75); b) letter values score - change accordingly by changing function (possibly patogenic and probably patogenic P patogenic)
  4. count occurences of U,P,B result: anything with absolute majority of acurences.

I know that the filtering process is maybe not so precise, but it is only way what i was capable to thought off for now. Any suggestion is appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 2740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6