Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

Nikolay Bogoychev, Marcin Junczys-Dowmunt, Kenneth Heafield, Alham Aji

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27%faster than an optimized baseline with negligible penalty in BLEU.
Original languageEnglish
Title of host publicationProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Place of PublicationBrussels, Belgium
PublisherAssociation for Computational Linguistics (ACL)
Pages2991-2996
Number of pages6
Publication statusPublished - Nov 2018
Event2018 Conference on Empirical Methods in Natural Language Processing - Square Meeting Center, Brussels, Belgium
Duration: 31 Oct 20184 Nov 2018
http://emnlp2018.org/

Conference

Conference2018 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2018
Country/TerritoryBelgium
CityBrussels
Period31/10/184/11/18
Internet address

Fingerprint

Dive into the research topics of 'Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation'. Together they form a unique fingerprint.

Cite this