Dears, I am new in machine learning and try to apply the XGBoost to find the feature importance and plot AUC curve on my data but the samples are unbalanced, the control is 24 samples while the diseased is 153 samples. I tried to make downsampling for the diseased but I don't know to make the downsampling on the whole data before split the data to training and testing the data or after that. If after that, should I make the down sampling on testing data or training data and why ?
Hope someone explain to me and provide me some informative tutorials. Regards,
Tutorial:
https://machinelearningmastery.com/xgboost-for-imbalanced-classification/
Really thanks for this tutorial, I used it but in the first trial on my data the AUC results appears have no fitting as its results around 0.78 while after try it again the AUC appeared overfitting as its results equal 1. I tried many times again it still 1, do you have any explanation and how I overcome this overfitting ? Is it a good idea to use other model ? if yes, what do you recommend ? or just satisfy with the first result which is 0.78 as it is the first result
I have no idea what exactly you have done, so it is impossible to give you a meaningful advice. Did you use
scale_pos_weight
? Did you try to vary its values? Did you perform parameter tuning? It is possible to get a classifier that legitimately hasAUC=1
as your classes may be very different from each other and therefore easy to classify. So we don't even know with certainty that you are overfitting.If you followed the tutorial, there are guidelines in it against overfitting. What would certainly help you is to collect more data of both classes, but especially for control cases.
I followed the tutorial and the results are the same as AUC is nearly 1 but I used this tutorial: https://www.kaggle.com/saxinou/imbalanced-data-xgboost-tunning , the AUC decreased. The main problem meanly changes in the parameters of Xgboost.
I want to ask extra question please, should the random state of splitting the data be equal the random state of classifier ?