Efficient Machine Translation with Model Pruning and Quantization

Maximiliana Behnke, Nikolay Bogoychev, Alham Fikri Aji, Kenneth Heafield, Graeme Nail, Qianqian Zhu, Svetlana Tchistiakova, Jelmer van der Linde, Pinzhen Chen, Sidharth Kashyap, Roman Grundkiewicz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.
Original languageEnglish
Title of host publicationProceedings of the Sixth Conference on Machine Translation
Place of PublicationStroudsburg, PA, USA
PublisherAssociation for Computational Linguistics
Number of pages6
ISBN (Print)978-1-954085-94-7
Publication statusPublished - 10 Nov 2021
EventEMNLP 2021 Sixth Conference on Machine Translation (WMT) - Punta Cana, Dominican Republic
Duration: 10 Nov 202111 Nov 2021
Conference number: 6


ConferenceEMNLP 2021 Sixth Conference on Machine Translation (WMT)
Abbreviated titleWMT 21
Country/TerritoryDominican Republic
CityPunta Cana
Internet address


Dive into the research topics of 'Efficient Machine Translation with Model Pruning and Quantization'. Together they form a unique fingerprint.

Cite this