Projects per year
Abstract
Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one dimensional data distributions by histograms.
A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: given samples from a distribution p over {1,...,n}, compute a k histogram that minimizes the l2-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information-theoretically minimal sample size of m = O(1/ε2), runs in sample-linear time O(m), and outputs an O(k)-histogram whose l2-distance from p is at most O(optk) +ε, where optk is the minimum l2-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size.
We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude im rovement in terms of empirical running times over previous state-of-the-art algorithms.
A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: given samples from a distribution p over {1,...,n}, compute a k histogram that minimizes the l2-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information-theoretically minimal sample size of m = O(1/ε2), runs in sample-linear time O(m), and outputs an O(k)-histogram whose l2-distance from p is at most O(optk) +ε, where optk is the minimum l2-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size.
We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude im rovement in terms of empirical running times over previous state-of-the-art algorithms.
Original language | English |
---|---|
Title of host publication | Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems |
Publisher | Association for Computing Machinery (ACM) |
Pages | 249-263 |
Number of pages | 15 |
ISBN (Electronic) | 9781450327572 |
DOIs | |
Publication status | Published - 20 May 2015 |
Publication series
Name | PODS '15 |
---|
Fingerprint
Dive into the research topics of 'Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Sublinear Algorithms for Approximating Probability Distribution
Diakonikolas, I.
1/09/14 → 31/08/15
Project: Research