Abstract
Discovering hidden structure in data is one of the cornerstones of modern
data analysis. Due to the diversity and complexity of modern data sets, this
is a very challenging task and the role of efficient algorithms is of paramount
importance in this context. The majority of available data sets are in raw
and unstructured form, consisting of example points without corresponding
labels. A large class of unlabeled datasets can be modeled as samples from a
probability distribution over a very large domain. An important goal in the
exploration of these datasets is understanding the underlying distributions.
Estimating distributions from samples is a paradigmatic and fundamental
unsupervised learning problem that has been studied in statistics since the late
nineteenth century, starting with the pioneering work of Karl Pearson [55].
During the past couple of decades, there has been a large body of work in
computer science on this topic with a focus on computational efficiency.
The area of distribution estimation is well-motivated in its own right, and
has seen a recent surge of research activity, in part due to the ubiquity of
structured distributions in the natural and social sciences. Such structural
properties of distributions are sometimes direct consequences of the underlying
1
2 Learning Structured Distributions
application problem, or they are a plausible explanation of the model under
investigation.
In this chapter, we give a survey of both classical and modern techniques
for distribution estimation, with a focus on recent algorithmic ideas developed
in theoretical computer science. These ideas have led to computationally and
statistically efficient algorithms for learning broad families of models. For the
sake of concreteness, we illustrate these ideas with specific examples. Finally,
we highlight outstanding challenges and research directions for future work.
data analysis. Due to the diversity and complexity of modern data sets, this
is a very challenging task and the role of efficient algorithms is of paramount
importance in this context. The majority of available data sets are in raw
and unstructured form, consisting of example points without corresponding
labels. A large class of unlabeled datasets can be modeled as samples from a
probability distribution over a very large domain. An important goal in the
exploration of these datasets is understanding the underlying distributions.
Estimating distributions from samples is a paradigmatic and fundamental
unsupervised learning problem that has been studied in statistics since the late
nineteenth century, starting with the pioneering work of Karl Pearson [55].
During the past couple of decades, there has been a large body of work in
computer science on this topic with a focus on computational efficiency.
The area of distribution estimation is well-motivated in its own right, and
has seen a recent surge of research activity, in part due to the ubiquity of
structured distributions in the natural and social sciences. Such structural
properties of distributions are sometimes direct consequences of the underlying
1
2 Learning Structured Distributions
application problem, or they are a plausible explanation of the model under
investigation.
In this chapter, we give a survey of both classical and modern techniques
for distribution estimation, with a focus on recent algorithmic ideas developed
in theoretical computer science. These ideas have led to computationally and
statistically efficient algorithms for learning broad families of models. For the
sake of concreteness, we illustrate these ideas with specific examples. Finally,
we highlight outstanding challenges and research directions for future work.
Original language | English |
---|---|
Title of host publication | Handbook of Big Data |
Publisher | CRC Press |
Pages | 267-288 |
Number of pages | 22 |
Publication status | Published - 2016 |