Spotnik: Designing Distributed Machine Learning for Transient Cloud Resources

Marcel Wagenländer, Luo Mai, Li Guo, Peter Pietzuch

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

To achieve higher utilisation, cloud providers offer VMs with GPUs as lower-cost transient cloud resources. Transient VMs can be revoked at short notice and vary in their availability. This poses challenges to distributed machine learning (ML) jobs, which perform long-running stateful computation in which many workers maintain and synchronise model replicas. With transient VMs, existing systems either require a fixed number of reserved VMs or degrade performance when recovering from revoked transient VMs. We believe that future distributed ML systems must be de- signed from the ground up for transient cloud resources. This paper describes SPOTNIK, a system for training ML models that features a more adaptive design to accommodate transient VMs: (i) SPOTNIK uses an adaptive implementation of the all-reduce collective communication operation. As workers on transient VMs are revoked, SPOTNIK updates its membership and uses the all-reduce ring to recover; and (ii) SPOTNIK supports the adaptation of the synchronisation strategy between workers. This allows a training job to switch between different strategies in response to the revocation of transient VMs. Our experiments show that, after VM revocation, SPOTNIK recovers training within 300 ms for ResNet/ImageNet.
Original languageEnglish
Title of host publicationHotCloud'20: Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing
PublisherUSENIX Association
Number of pages8
Publication statusPublished - 14 Jul 2020
Event12th USENIX Workshop on Hot Topics in Cloud Computing -
Duration: 13 Jul 202014 Jul 2020
https://www.usenix.org/conference/hotcloud20

Workshop

Workshop12th USENIX Workshop on Hot Topics in Cloud Computing
Abbreviated titleHotcloud '20
Period13/07/2014/07/20
Internet address

Fingerprint Dive into the research topics of 'Spotnik: Designing Distributed Machine Learning for Transient Cloud Resources'. Together they form a unique fingerprint.

Cite this