We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.
|Title of host publication||Proceedings for The Ninth Conference of the Association for Machine Translation in the Americas|
|Publisher||Association for Machine Translation in the Americas, AMTA|
|Number of pages||9|
|Publication status||Published - 2010|