Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

One way to reduce network traffic in multinode data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Tranformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training
Original languageEnglish
Title of host publicationProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
Place of PublicationHong Kong
PublisherAssociation for Computational Linguistics
Pages3624–3629
Number of pages6
ISBN (Print)978-1-950737-90-1
DOIs
Publication statusPublished - 4 Nov 2019
Event2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing - Hong Kong, Hong Kong
Duration: 3 Nov 20197 Nov 2019
https://www.emnlp-ijcnlp2019.org/

Conference

Conference2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
Abbreviated titleEMNLP-IJCNLP 2019
CountryHong Kong
CityHong Kong
Period3/11/197/11/19
Internet address

Fingerprint Dive into the research topics of 'Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training'. Together they form a unique fingerprint.

Cite this