I. Prediction using ML medthods
Methods to use Support vector machine (Random Forest?)
- Download variant data from Clinvar, other DBs and from literature
- Remove unnecessary columns, missing data and keep c.change, p.change, clinical significance, etc.
- Use sklearn/pandas(onehotencoder/getdummies) to convert data into binary
- for Position Specific Method, take c.change/p.change remove c/p and make 3 columns(wild,loc,new) and convert all other charaters into strings
- split data into train and test and classifiy using SVC
- draw confusion matrix and check
- determine cross vaildation using gridsearchcv
- draw confusion matrix and check
- classifiy with the newly determined c and gamma
- classifiy with patients variants
II. Prediction of effects of variants with known prediction tools
- compare with results of previous step (tools eg. SIFT, polyphen etc)
III. Compare and corelate both results with disease severity - compare and corelate results first two steps with the patient's disease severity and infarance
Questions
- Does this workflow make sense?
- Any suggestion/advice/opinion to imporve the workflow
- For small datasets SVM and random forest(since it uses decision tree) is better when compared to others?
- Can i use only pathogenic and likely pathogenic or use variants of uncertain significance too for training?
Objective is to predict/corelate pathogenicity of variant with patient's phenotype/severity
why i am predicting rather then using known prediction tools is because i would be using SVM with Position Specific Method on a rare germline disorder with specific genes which are known to cause the disorder.
Thanks in advance for your time