Communication-Optimal Distributed Clustering

Jiechao Chen, He Sun, David Woodruff, Qin Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n*s, while in the broadcast model it is n + s. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.
Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 29 (NIPS 2016)
Place of PublicationBarcelona, Spain
PublisherCurran Associates Inc
Pages3727-3735
Number of pages9
Publication statusPublished - 10 Dec 2016
Event30th Annual Conference on Neural Information Processing Systems - Barcelona, Spain
Duration: 5 Dec 201610 Dec 2016
https://nips.cc/Conferences/2016

Publication series

NameAdvances in Neural Information Processing Systems
Volume29

Conference

Conference30th Annual Conference on Neural Information Processing Systems
Abbreviated titleNIPS 2016
Country/TerritorySpain
CityBarcelona
Period5/12/1610/12/16
Internet address

Fingerprint

Dive into the research topics of 'Communication-Optimal Distributed Clustering'. Together they form a unique fingerprint.

Cite this