Stream-based Translation Models for Statistical Machine Translation

Abby Levenberg, Chris Callison-Burch, Miles Osborne

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Typical statistical machine translation systems are trained with static parallel corpora. Here we account for scenarios with a continuous incoming stream of parallel training data. Such scenarios include daily governmental proceedings, sustained output from translation agencies, or crowd-sourced translations. We show incorporating recent sentence pairs from the stream improves performance compared with a static baseline. Since frequent batch retraining is computationally demanding we introduce a fast incremental alternative using an online version of the EM algorithm. To bound our memory requirements we use a novel data-structure and associated training regime. When compared to frequent batch retraining, our online time and space-bounded model achieves the same performance with significantly less computational overhead.
Original languageEnglish
Title of host publicationHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Place of PublicationLos Angeles, California
PublisherAssociation for Computational Linguistics
Pages394-402
Number of pages9
ISBN (Print)978-1-932432-65-7
Publication statusPublished - 1 Jun 2010

Fingerprint

Dive into the research topics of 'Stream-based Translation Models for Statistical Machine Translation'. Together they form a unique fingerprint.

Cite this