Losing Heads in the Lottery: Pruning Transformer

Maximiliana Behnke, Kenneth Heafield

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The attention mechanism is the crucial component of the transformer architecture. Recent research shows that most attention heads are not confident in their decisions and can be pruned. However, removing them before training a model results in lower quality. In this paper, we apply the lottery ticket hypothesis to prune heads in the early stages of training. Our experiments on machine translation show that it is possible to remove up to three-quarters of attention heads from transformer-big during early training with an average -0.1 change in BLEU for Turkish→English. The pruned model is 1.5 times as fast at inference, albeit at the cost of longer training. Our method is complementary to other approaches, such as teacher-student, with English→German student model gaining an additional 10% speed-up with 75% encoder attention removed and 0.2 BLEU loss.
Original languageEnglish
Title of host publicationProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
PublisherAssociation for Computational Linguistics (ACL)
Pages2664–2674
Number of pages11
ISBN (Print)978-1-952148-60-6
Publication statusPublished - 16 Nov 2020
EventThe 2020 Conference on Empirical Methods in Natural Language Processing - Online
Duration: 16 Nov 202020 Nov 2020
https://2020.emnlp.org/

Conference

ConferenceThe 2020 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2020
Period16/11/2020/11/20
Internet address

Fingerprint

Dive into the research topics of 'Losing Heads in the Lottery: Pruning Transformer'. Together they form a unique fingerprint.

Cite this