Abstract / Description of output
We address the problem of unknown words, also known as out of vocabulary (OOV) words, in machine translation of low resource languages. Our technique comprises a combination of methods, inspired by the common OOV types observed. We also design evaluation techniques for measuring coverage of OOVs achieved and integrate the new translation candidates in a Statistical Machine Translation (SMT) system. Experimental results on Hindi and Uzbek show that our system achieves a good coverage of OOV words. We show that our methods produced correct candidates for 50% of Hindi OOVs and 30% of Uzbek OOVs, in scenarios that have 1 and 3 OOVs per sentence. This offers a potential for improvement of translation quality for languages that have limited parallel data available for training
Original language | English |
---|---|
Title of host publication | Proceedings of AMTA 2016, vol. 1: MT Researchers’ Track |
Place of Publication | Austin, Texas, United States |
Publisher | Association for Machine Translation in the Americas, AMTA |
Pages | 163-176 |
Number of pages | 14 |
Publication status | Published - 1 Nov 2016 |
Event | Twelfth Conference of The Association for Machine Translation in the Americas - Austin, United States Duration: 28 Oct 2016 → 1 Nov 2016 http://www.amta2016.org/ https://amtaweb.org/amta-2016-proceedings-are-available/ |
Conference
Conference | Twelfth Conference of The Association for Machine Translation in the Americas |
---|---|
Abbreviated title | AMTA 2016 |
Country/Territory | United States |
City | Austin |
Period | 28/10/16 → 1/11/16 |
Internet address |