Compressing Neural Machine Translation Models with 4-bit Precision

Alham Fikri Aji, Kenneth Heafield

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Quantization is one way to compress Neural Machine Translation (NMT) models, especially for edge devices. This paper pushes quantization from 8 bits, seen in current work on machine translation, to 4 bits. Instead of fixed-point quantization, we use logarithmic quantization since parameters are skewed towards zero. We then observe that quantizing the bias terms in this way damages quality, so we leave them uncompressed. Bias terms are a tiny fraction of the model so the impact on compression rate is minimal. Retraining is necessary to preserve quality, for which we propose to use an error-feedback mechanism that treats compression errors like noisy gradients. We empirically show that NMT models based on the Transformer or RNN architectures can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. The RNN architecture appears more robust towards compression, compared to the Transformer.
Original languageEnglish
Title of host publicationProceedings of the Fourth Workshop on Neural Generation and Translation
Place of PublicationSeattle
PublisherAssociation for Computational Linguistics (ACL)
Number of pages8
ISBN (Electronic)978-1-952148-17-0
Publication statusPublished - 10 Jul 2020
EventThe 4th Workshop on Neural Generation and Translation - Online workshop, Seattle, United States
Duration: 10 Jul 202010 Jul 2020


WorkshopThe 4th Workshop on Neural Generation and Translation
Abbreviated titleWNGT 2020
Country/TerritoryUnited States
Internet address


Dive into the research topics of 'Compressing Neural Machine Translation Models with 4-bit Precision'. Together they form a unique fingerprint.

Cite this