Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms

Jayadev Acharya, Ilias Diakonikolas, Chinmay Hegde, Jerry Zheng Li, Ludwig Schmidt

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one dimensional data distributions by histograms.

A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: given samples from a distribution p over {1,...,n}, compute a k histogram that minimizes the l2-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information-theoretically minimal sample size of m = O(1/ε2), runs in sample-linear time O(m), and outputs an O(k)-histogram whose l2-distance from p is at most O(optk) +ε, where optk is the minimum l2-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size.

We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude im rovement in terms of empirical running times over previous state-of-the-art algorithms.
Original languageEnglish
Title of host publicationProceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
PublisherAssociation for Computing Machinery (ACM)
Pages249-263
Number of pages15
ISBN (Electronic)9781450327572
DOIs
Publication statusPublished - 20 May 2015

Publication series

NamePODS '15

Fingerprint

Dive into the research topics of 'Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms'. Together they form a unique fingerprint.

Cite this