Question

how to measure significance of improvement between two methods during modeling

0

Entering edit mode

6 months ago

rheab1230 ▴ 140

Hello everyone

I have two methods for developing models. I use a set of input features (set1) for training model1 and another set of input features which have (set1 features + extra set2 features).

I then develop spearman correlation to compare the two models and see that model2 which have more features is giving better correlation.

For example: Gene1 using model1 has correlation around 0.56

Gene1 using model2 has correlation around 0.62

So the improvement by adding more set2 features is around 0.06

Now, I want to understand whether this improvement of around 0.06 is significant or not or whether this improvement is due to random noise/fluctuations

Can anyone please tell me how I can measure the significance of improvement for each gene model and measure whether adding set2 features which leads to improvement is due to adding more informative features and not due to random noise/fluctuations.

Thank you

spearman correlation • 407 views

ADD COMMENT • link updated 6 months ago by Jean-Karim Heriche 27k • written 6 months ago by rheab1230 ▴ 140

0

Entering edit mode

I think what you're looking for is cross-validation though if you're not limited to just strictly separated set 1 and set 2, I would start with the full set of features (set 1 + set 2) and look at feature selection methods and feature importance to select the best subset of features.

ADD REPLY • link 6 months ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hello, I have used cross validation and feature selection from set1 and set2 to select important features. So, we start with full set (set 1 and set2) and do feature selection to get 150 features from set1 and 150 from (set1 +set2)

ADD REPLY • link 6 months ago by rheab1230 ▴ 140

0

Entering edit mode

Cross validation gives you an idea of the distribution of your target measure (i.e. you get one estimate for each fold) so you can compare the distributions of these values between the two models.

ADD REPLY • link 6 months ago by Jean-Karim Heriche 27k