Trade-Offs in Large-Scale Distributed Tuplewise Estimation And Learning

Robin Vogel, Aurélien Bellet, Stephan Clémençon, Ons Jelassi, Guillaume Papa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involving pairs (or more generally tuples) of data points---such as metric learning, clustering or ranking---do not lend themselves as easily to data-parallelism and in-memory computing. In this paper, we investigate how to balance between statistical performance and computational efficiency in such distributed tuplewise statistical problems. We first propose a simple strategy based on occasionally repartitioning data across workers between parallel computation stages, where the number of repartitioning steps rules the trade-off between accuracy and runtime. We then present some theoretical results highlighting the benefits brought by the proposed method in terms of variance reduction, and extend our results to design distributed stochastic gradient descent algorithms for tuplewise empirical risk minimization. Our results are supported by numerical experiments in pairwise statistical estimation and learning on synthetic and real-world datasets.
Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases
EditorsUlf Brefeld, Elisa Fromont, Andreas Hotho, Arno Knobbe, Marloes Maathuis, Céline Robardet
Place of PublicationCham
PublisherSpringer
Pages229-245
Number of pages17
ISBN (Electronic)978-3-030-46147-8
ISBN (Print)978-3-030-46146-1
DOIs
Publication statusPublished - 30 Apr 2020
EventThe European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2019 - https://ecmlpkdd2019.org/, Würzburg, Germany
Duration: 16 Sept 201920 Sept 2019

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume11907
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceThe European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2019
Abbreviated titleECMLPKDD 2019
Country/TerritoryGermany
CityWürzburg
Period16/09/1920/09/19

Keywords / Materials (for Non-textual outputs)

  • Distributed machine learning
  • Distributed data processing
  • U-Statistics
  • Stochastic gradient descent
  • AUC optimization

Fingerprint

Dive into the research topics of 'Trade-Offs in Large-Scale Distributed Tuplewise Estimation And Learning'. Together they form a unique fingerprint.

Cite this