Abstract
Using statistical machine translation (SMT) for dialectal varieties usually suffers from data sparsity, but combining word-level and
character-level models can yield good results even with small training data by exploiting the relative proximity between the two
varieties. In this paper, we describe a specific problem and its solution, arising with the translation between standard Austrian German
and Viennese dialect. In a phrase-based approach of SMT, complex lexical transformations and syntactic reordering cannot be dealt
with. These are typical cases where rule-based preprocessing of the source data is the preferable option, hence the hybrid character of
the resulting system. One such case is the transformation between imperfect verb forms to perfect tense, which involves detection of
clause boundaries and identification of clause type. We present an approach that utilizes a full parse of the source sentences and
discuss the problems that arise with such an approach. Within the developed SMT system, the models trained on preprocessed data
unsurprisingly fare better than those trained on the original data, but also unchanged sentences gain slightly better scores. This shows
that including a rule-based layer dealing with systematic non-local transformations increases the overall performance of the system,
most probably due to a higher accuracy in the alignment.
Original language | English |
---|---|
Title of host publication | Proceedings of the Sixth Language and Technology Conference |
Number of pages | 5 |
Publication status | Published - 2013 |