A hybrid approach to statistical machine translation between standard and dialectal varieties

Friedrich Neubarth, Barry Haddow, Adolfo Hernandez, Harald Trost

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Using statistical machine translation (SMT) for dialectal varieties usually suffers from data sparsity, but combining word-level and character-level models can yield good results even with small training data by exploiting the relative proximity between the two varieties. In this paper, we describe a specific problem and its solution, arising with the translation between standard Austrian German and Viennese dialect. In a phrase-based approach of SMT, complex lexical transformations and syntactic reordering cannot be dealt with. These are typical cases where rule-based preprocessing of the source data is the preferable option, hence the hybrid character of the resulting system. One such case is the transformation between imperfect verb forms to perfect tense, which involves detection of clause boundaries and identification of clause type. We present an approach that utilizes a full parse of the source sentences and discuss the problems that arise with such an approach. Within the developed SMT system, the models trained on preprocessed data unsurprisingly fare better than those trained on the original data, but also unchanged sentences gain slightly better scores. This shows that including a rule-based layer dealing with systematic non-local transformations increases the overall performance of the system, most probably due to a higher accuracy in the alignment.
Original languageEnglish
Title of host publicationProceedings of the Sixth Language and Technology Conference
Number of pages5
Publication statusPublished - 2013

Fingerprint

Dive into the research topics of 'A hybrid approach to statistical machine translation between standard and dialectal varieties'. Together they form a unique fingerprint.

Cite this