Abstract
Semitic languages pose a problem to Natural Language Processing since most of the vowels are omitted from written prose, resulting in considerable ambiguity at the word level. However, while reading text, native speakers can generally vocalize each word based on their familiarity with the lexicon and the context of the word. Methods for vowel restoration in previous work involving morphological analysis concentrated on a single language and relied on a parsed corpus that is difficult to create for many Semitic languages. We show that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages. Our technique is simple to implement, does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic. Using a publicly available version of the Bible and the Qur'an as corpora, we achieve a success rate of 86% for restoring the exact vowel pattern in Arabic and 81% in Hebrew. For Hebrew, we also report on 87% success rate for restoring the correct phonetic value of the words.
| Original language | English |
|---|---|
| Title of host publication | SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages |
| Place of Publication | Philadelphia, Pennsylvania, USA |
| Publisher | ACM |
| Pages | 1-7 |
| Number of pages | 7 |
| DOIs | |
| Publication status | Published - 11 Jul 2002 |
| Event | ACL-02 workshop on Computational approaches to semitic languages - Philadelphia, United States Duration: 11 Jul 2002 → 11 Jul 2002 |
Conference
| Conference | ACL-02 workshop on Computational approaches to semitic languages |
|---|---|
| Abbreviated title | SEMITIC '02 |
| Country/Territory | United States |
| City | Philadelphia |
| Period | 11/07/02 → 11/07/02 |
Fingerprint
Dive into the research topics of 'An HMM approach to vowel restoration in Arabic and Hebrew'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver