Estimating the Number of Distinct Items in a Database by Sampling

January 1, 2019
Abstract

Counting the number of distinct items in a dataset is a well known computational problem with numerous applications. Sometimes, exact counting is infeasible, and one must use some approximation method. One approach to approximation is to estimate the number of distinct items from a random sample. This approach is useful, for example, when the dataset is too big, or when only a sample is available, but not the entire data. Moreover, it can considerably speed up the computation. In statistics, this problem is known as the Unseen Species Problem. In this paper, we propose an estimation method for this problem, which is especially suitable for cases where the sample is much smaller than the entire set, and the number of repetitions of each item is relatively small. Our method is simple in comparison to known methods, and gives good enough estimates to make it useful in certain real life datasets that arise in data mining scenarios. We demonstrate our method on real data where the task at hand is to estimate the number of duplicate URLs.

Download
Publication Type
Paper
Conference / Journal Name
CIKM

BibTeX


@inproceedings{
    author = {},
    title = {‌Estimating the Number of Distinct Items in a Database by Sampling‌},
    booktitle = {Proceedings of CIKM‌},
    year = {‌2019‌}
}