Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connections and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified average based self-attention sublayer and the encoder-decoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.
Original languageEnglish
Title of host publicationProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
PublisherAssociation for Computational Linguistics (ACL)
Pages898–909
Number of pages12
ISBN (Print)978-1-950737-90-1
DOIs
Publication statusPublished - 4 Nov 2019
Event2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing - Hong Kong, Hong Kong
Duration: 3 Nov 20197 Nov 2019
https://www.emnlp-ijcnlp2019.org/

Conference

Conference2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
Abbreviated titleEMNLP-IJCNLP 2019
CountryHong Kong
CityHong Kong
Period3/11/197/11/19
Internet address

Fingerprint Dive into the research topics of 'Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention'. Together they form a unique fingerprint.

Cite this