Making Asynchronous Stochastic Gradient Descent Work for Transformers

Alham Fikri Aji, Kenneth Heafield

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast.
Original languageEnglish
Title of host publicationProceedings of the The 3rd Workshop on Neural Generation and Translation (WNGT 2019)
Place of PublicationHong Kong
PublisherAssociation for Computational Linguistics (ACL)
Pages80–89
Number of pages10
ISBN (Print)78-1-950737-83-3
DOIs
Publication statusPublished - 4 Nov 2019
EventThe 3rd Workshop on Neural Generation and Translation: at EMNLP-IJCNLP 2019 - Hong Kong, Hong Kong
Duration: 4 Nov 20194 Nov 2019
https://sites.google.com/view/wngt19/home

Workshop

WorkshopThe 3rd Workshop on Neural Generation and Translation
Abbreviated titleWNGT 2019
CountryHong Kong
CityHong Kong
Period4/11/194/11/19
Internet address

Fingerprint Dive into the research topics of 'Making Asynchronous Stochastic Gradient Descent Work for Transformers'. Together they form a unique fingerprint.

Cite this