Precision/Recall on Imbalanced Test Data

January 24, 2023
Abstract

In this paper we study the problem of accurately estimating the precision and recall of a binary classifier when the classes are imbalanced and only a limited number of human labels (test set) are available. In this case, one common strategy is to over-sample the small positive class predicted by the classifier. But how much should we over-sample? And what confidence/credible intervals can we deduce based on our over-sampling? We provide formulas for (1) the confidence intervals of the adjusted precision/recall after oversampling; (2) Bayesian credible intervals of adjusted precision/recall. For precision, the higher the over-sampling rate, the narrower the confidence/credible interval. For recall, there exists an optimal oversampling ratio, which minimizes the width of the confidence/credible interval. Also, we present experiments on synthetic data and real data to demonstrate the capability of our method to construct accurate intervals. Finally, we demonstrate how we can apply our techniques to a quality monitoring system. We find the size of the smallest editorial test set for a set of classifiers given that the precision and recall be within 5% error rate.

Download
Publication Type
Paper
Conference / Journal Name
AISTATS '23

BibTeX


@inproceedings{
    author = {},
    title = {‌Precision/Recall on Imbalanced Test Data‌},
    booktitle = {Proceedings of AISTATS '23‌},
    year = {‌2023‌}
}