Automatic evaluation metrics should correlate with human judgement. We collected sixteen ASR mediated dialogues using a map task scenario. The material was assessed extrinsically (i.e. in context) through measures like time to task completion and intrinsically (i.e. out of context) using the word error rate and several variants thereof, which are based on smaller units. Extrinsic and intrinsic results did not correlate, neither for word error rate nor for metrics based on characters, syllables or phonemes.
|Title of host publication||Advances in Natural Language Processing|
|Editors||Adam Przepiórkowski, Maciej Ogrodniczuk|
|Number of pages||8|
|Publication status||Published - 2014|
|Name||Lecture Notes in Computer Science|