I am building several machine learning models for predicting an outcome and found a slight difference between those models. Let's say, the AUC of model A is 0.81, and 0.83 for model B. Does model B significantly improve the predictive power and outperform model A? How much difference in AUC can we consider a significant improvement? Any comments would be greatly appreciated.
If you are comparing within your own models, I don't think it matters whether the improvement is significant or not. I would use the best model available, regardless of how little it is better than the next best.
To your specific question, I don't know if model B is better than A, even though its AUC is better. AUC essentially tells us whether the predictions are in the correct order. Generally speaking, that means the actual 0 category members are predicted with probabilities smaller than 1 category members. But that allows for 0s to actually have probabilities higher than 0.5 as long as they come before 1s, or for 1s to have probabilities lower than 0.5 as long as they come after 0s. Because of that classifier B may have a higher AUC than classifier A but lower accuracy. Usually they are correlated, but not always. This is a long way of saying that I suggest looking at multiple measures of classifier quality. Specifically, I suggest that you calculate log-loss for your models. If model B has higher AUC and lower log-loss than model A, that would boost its case as being better. I would never use simple accuracy as an optimization target during model fitting, but in the end it is worth checking how the two models stack up in that regard.
AUC as a single measure of classifier quality works best for highly imbalanced sets, when the ratio of the two classes is 3:1 or higher. In such cases it is highly unlikely (though still possible) that improved AUC would result in lower accuracy or higher log-loss. If the dataset is well-balanced, it may be worth trying log-loss as an optimization proxy. Log-loss tells you not only how accurate the classifier is, but also how confident the predictions are. Here is a contrived example: let's say that all 0s in the dataset are predicted with a probability of 0.49, and all 1s with a probability of 0.51. That classifier would have perfect accuracy and a perfect AUC score of 1. It would have a terrible log-loss (around 0.67) because the predictions would have little separation and therefore are not confident.
Many thanks for the comprehensive explanation and suggestions. I am much clearer.