Abstract
One way to reduce network traffic in multinode data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Tranformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing |
| Place of Publication | Hong Kong |
| Publisher | Association for Computational Linguistics |
| Pages | 3624–3629 |
| Number of pages | 6 |
| ISBN (Print) | 978-1-950737-90-1 |
| DOIs | |
| Publication status | Published - 4 Nov 2019 |
| Event | 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing - Hong Kong, Hong Kong Duration: 3 Nov 2019 → 7 Nov 2019 https://www.emnlp-ijcnlp2019.org/ |
Conference
| Conference | 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing |
|---|---|
| Abbreviated title | EMNLP-IJCNLP 2019 |
| Country/Territory | Hong Kong |
| City | Hong Kong |
| Period | 3/11/19 → 7/11/19 |
| Internet address |
Fingerprint
Dive into the research topics of 'Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver