Scalable Modified Kneser-Ney Language Model Estimation

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, Philipp Koehn

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We present an efficient algorithm to estimate large modified Kneser-Ney models
including interpolation. Streaming and sorting enables the algorithm to scale
to much larger models by using a fixed amount of RAM and variable amount of
disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned
model on 126 billion tokens. Machine translation experiments with this model
show improvement of 0.8 BLEU point over constrained systems for the 2013
Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.
Original languageEnglish
Title of host publicationProceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 2: Short Papers
Pages690-696
Number of pages7
Publication statusPublished - 2013

Fingerprint

Dive into the research topics of 'Scalable Modified Kneser-Ney Language Model Estimation'. Together they form a unique fingerprint.

Cite this