Abstract
To achieve higher utilisation, cloud providers offer VMs with GPUs as lower-cost transient cloud resources. Transient VMs can be revoked at short notice and vary in their availability. This poses challenges to distributed machine learning (ML) jobs, which perform long-running stateful computation in which many workers maintain and synchronise model replicas. With transient VMs, existing systems either require a fixed number of reserved VMs or degrade performance when recovering from revoked transient VMs. We believe that future distributed ML systems must be de- signed from the ground up for transient cloud resources. This paper describes SPOTNIK, a system for training ML models that features a more adaptive design to accommodate transient VMs: (i) SPOTNIK uses an adaptive implementation of the all-reduce collective communication operation. As workers on transient VMs are revoked, SPOTNIK updates its membership and uses the all-reduce ring to recover; and (ii) SPOTNIK supports the adaptation of the synchronisation strategy between workers. This allows a training job to switch between different strategies in response to the revocation of transient VMs. Our experiments show that, after VM revocation, SPOTNIK recovers training within 300 ms for ResNet/ImageNet.
Original language | English |
---|---|
Title of host publication | HotCloud'20: Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing |
Publisher | USENIX Association |
Number of pages | 8 |
Publication status | Published - 14 Jul 2020 |
Event | 12th USENIX Workshop on Hot Topics in Cloud Computing - Duration: 13 Jul 2020 → 14 Jul 2020 https://www.usenix.org/conference/hotcloud20 |
Workshop
Workshop | 12th USENIX Workshop on Hot Topics in Cloud Computing |
---|---|
Abbreviated title | Hotcloud '20 |
Period | 13/07/20 → 14/07/20 |
Internet address |