Translation of Unknown Words in Low Resource Languages

Biman Gujral, Huda Khayrallah, Philipp Koehn

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We address the problem of unknown words, also known as out of vocabulary (OOV) words, in machine translation of low resource languages. Our technique comprises a combination of methods, inspired by the common OOV types observed. We also design evaluation techniques for measuring coverage of OOVs achieved and integrate the new translation candidates in a Statistical Machine Translation (SMT) system. Experimental results on Hindi and Uzbek show that our system achieves a good coverage of OOV words. We show that our methods produced correct candidates for 50% of Hindi OOVs and 30% of Uzbek OOVs, in scenarios that have 1 and 3 OOVs per sentence. This offers a potential for improvement of translation quality for languages that have limited parallel data available for training
Original languageEnglish
Title of host publicationProceedings of AMTA 2016, vol. 1: MT Researchers’ Track
Place of PublicationAustin, Texas, United States
PublisherAssociation for Machine Translation in the Americas, AMTA
Pages163-176
Number of pages14
Publication statusPublished - 1 Nov 2016
EventTwelfth Conference of The Association for Machine Translation in the Americas - Austin, United States
Duration: 28 Oct 20161 Nov 2016
http://www.amta2016.org/
https://amtaweb.org/amta-2016-proceedings-are-available/

Conference

ConferenceTwelfth Conference of The Association for Machine Translation in the Americas
Abbreviated titleAMTA 2016
Country/TerritoryUnited States
CityAustin
Period28/10/161/11/16
Internet address

Fingerprint

Dive into the research topics of 'Translation of Unknown Words in Low Resource Languages'. Together they form a unique fingerprint.

Cite this