An HMM approach to vowel restoration in Arabic and Hebrew

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Semitic languages pose a problem to Natural Language Processing since most of the vowels are omitted from written prose, resulting in considerable ambiguity at the word level. However, while reading text, native speakers can generally vocalize each word based on their familiarity with the lexicon and the context of the word. Methods for vowel restoration in previous work involving morphological analysis concentrated on a single language and relied on a parsed corpus that is difficult to create for many Semitic languages. We show that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages. Our technique is simple to implement, does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic. Using a publicly available version of the Bible and the Qur'an as corpora, we achieve a success rate of 86% for restoring the exact vowel pattern in Arabic and 81% in Hebrew. For Hebrew, we also report on 87% success rate for restoring the correct phonetic value of the words.
Original languageEnglish
Title of host publicationSEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Place of PublicationPhiladelphia, Pennsylvania, USA
PublisherACM
Pages1-7
Number of pages7
DOIs
Publication statusPublished - 11 Jul 2002
EventACL-02 workshop on Computational approaches to semitic languages - Philadelphia, United States
Duration: 11 Jul 200211 Jul 2002

Conference

ConferenceACL-02 workshop on Computational approaches to semitic languages
Abbreviated titleSEMITIC '02
Country/TerritoryUnited States
CityPhiladelphia
Period11/07/0211/07/02

Fingerprint

Dive into the research topics of 'An HMM approach to vowel restoration in Arabic and Hebrew'. Together they form a unique fingerprint.

Cite this