In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Czech, this problem can be particularly severe. In addition, much of the morphological variation seen in Czech words is not reflected in either the morphology or syntax of a language like English. In this work, we show that using morphological analysis to modify the Czech input can improve a Czech-English machine translation system. We investigate several different methods of incorporating morphological information, and show that a system that combines these methods yields the best results. Our final system achieves a BLEU score of .333, as compared to .270 for the baseline word-to-word system.
|Title of host publication||Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing|
|Place of Publication||Vancouver, British Columbia, Canada|
|Publisher||Association for Computational Linguistics|
|Number of pages||8|
|Publication status||Published - 1 Oct 2005|