## Abstract

Discovering hidden structure in data is one of the cornerstones of modern

data analysis. Due to the diversity and complexity of modern data sets, this

is a very challenging task and the role of efficient algorithms is of paramount

importance in this context. The majority of available data sets are in raw

and unstructured form, consisting of example points without corresponding

labels. A large class of unlabeled datasets can be modeled as samples from a

probability distribution over a very large domain. An important goal in the

exploration of these datasets is understanding the underlying distributions.

Estimating distributions from samples is a paradigmatic and fundamental

unsupervised learning problem that has been studied in statistics since the late

nineteenth century, starting with the pioneering work of Karl Pearson [55].

During the past couple of decades, there has been a large body of work in

computer science on this topic with a focus on computational efficiency.

The area of distribution estimation is well-motivated in its own right, and

has seen a recent surge of research activity, in part due to the ubiquity of

structured distributions in the natural and social sciences. Such structural

properties of distributions are sometimes direct consequences of the underlying

1

2 Learning Structured Distributions

application problem, or they are a plausible explanation of the model under

investigation.

In this chapter, we give a survey of both classical and modern techniques

for distribution estimation, with a focus on recent algorithmic ideas developed

in theoretical computer science. These ideas have led to computationally and

statistically efficient algorithms for learning broad families of models. For the

sake of concreteness, we illustrate these ideas with specific examples. Finally,

we highlight outstanding challenges and research directions for future work.

data analysis. Due to the diversity and complexity of modern data sets, this

is a very challenging task and the role of efficient algorithms is of paramount

importance in this context. The majority of available data sets are in raw

and unstructured form, consisting of example points without corresponding

labels. A large class of unlabeled datasets can be modeled as samples from a

probability distribution over a very large domain. An important goal in the

exploration of these datasets is understanding the underlying distributions.

Estimating distributions from samples is a paradigmatic and fundamental

unsupervised learning problem that has been studied in statistics since the late

nineteenth century, starting with the pioneering work of Karl Pearson [55].

During the past couple of decades, there has been a large body of work in

computer science on this topic with a focus on computational efficiency.

The area of distribution estimation is well-motivated in its own right, and

has seen a recent surge of research activity, in part due to the ubiquity of

structured distributions in the natural and social sciences. Such structural

properties of distributions are sometimes direct consequences of the underlying

1

2 Learning Structured Distributions

application problem, or they are a plausible explanation of the model under

investigation.

In this chapter, we give a survey of both classical and modern techniques

for distribution estimation, with a focus on recent algorithmic ideas developed

in theoretical computer science. These ideas have led to computationally and

statistically efficient algorithms for learning broad families of models. For the

sake of concreteness, we illustrate these ideas with specific examples. Finally,

we highlight outstanding challenges and research directions for future work.

Original language | English |
---|---|

Title of host publication | Handbook of Big Data |

Publisher | CRC Press |

Pages | 267-288 |

Number of pages | 22 |

Publication status | Published - 2016 |