Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, Andreas Krause

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable, representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference on tens of millions of examples using Hadoop.
Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.
Number of pages9
Publication statusPublished - 2013


Dive into the research topics of 'Distributed Submodular Maximization: Identifying Representative Elements in Massive Data'. Together they form a unique fingerprint.

Cite this