I've trained a Gradient Boosting Machines algorithms on my data. Using the ROC metric, the test set produced a descent AUC of 0.73.
My data is basically made of aroud 1700 samples and 45 features. The samples are cancer patients who took some drug and we check if they responded (positive) or didn't respond (negative). This is the target variable, a binary variable. The data includes multiple cancer types (each cancer type has about 200 to 600 samples).
So next step in my assessment, I made a confusion matrix and the results were horrible, as the model predicted almost all samples as negative. Hence, the sensitivity and PPV were very low.
So now I've been asked to use the cutpointr package, which I didn't fully understand, but I think it should find the optimal cut of a ROC curve and decide which are the positive and which are the negative samples, according to this optimal cut.
But that doesn't make a lot of sense to me, because this is the whole idea of a ROC curve, the closest point to the top left corner is the optimal cut, so why do we need this package? what do they mean by find the optimal cut and use it to find the number of positives and negatives?
And for the most important question, which is the real reason I'm using this package, I need to check what is the number of the positive samples in each cancer type.. both in the train set and test set.
I don't know if I'm asking too much here, but I've been told its just two or three lines of code, but I have no clue where to start. Please help!
Note: If I need to provide any data samples please let me know.