Abstract
When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program.
We describe KungFu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the data flowgraph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations.
We describe KungFu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the data flowgraph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations.
Original language | English |
---|---|
Title of host publication | 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) |
Publisher | USENIX Association |
Pages | 937-954 |
Number of pages | 18 |
ISBN (Print) | 978-1-939133-19-9 |
Publication status | Published - 4 Nov 2020 |
Event | 14th USENIX Symposium on Operating Systems Design and Implementation - Banff, Canada Duration: 4 Nov 2020 → 6 Nov 2020 https://www.usenix.org/conference/osdi20 |
Symposium
Symposium | 14th USENIX Symposium on Operating Systems Design and Implementation |
---|---|
Abbreviated title | OSDI 2020 |
Country/Territory | Canada |
City | Banff |
Period | 4/11/20 → 6/11/20 |
Internet address |
Fingerprint
Dive into the research topics of 'KungFu: Making Training in Distributed Machine Learning Adaptive'. Together they form a unique fingerprint.Profiles
-
Luo Mai
- School of Informatics - Lecturer in Data Centric Systems
- Institute for Computing Systems Architecture
- Computer Systems
Person: Academic: Research Active