Letter-to-Sound Pronunciation Prediction Using Conditional Random Fields

Dong Wang, Simon King

Research output: Contribution to journalArticlepeer-review

Abstract

Pronunciation prediction, or letter-to-sound (LTS) conversion, is an essential task for speech synthesis, open vo- cabulary spoken term detection and other applications dealing with novel words. Most current approaches (at least for English) employ data-driven methods to learn and represent pronunciation ``rules'' using statistical models such as decision trees, hidden Markov models (HMMs) or joint-multigram models (JMMs). The LTS task remains challenging, particularly for languages with a complex relationship between spelling and pronunciation such as English. In this paper, we propose to use a conditional random field (CRF) to perform LTS because it avoids having to model a distribution over observations and can perform global inference, suggesting that it may be more suitable for LTS than decision trees, HMMs or JMMs. One challenge in applying CRFs to LTS is that the phoneme and grapheme sequences of a word are generally of different lengths, which makes CRF training difficult. To solve this problem, we employed a joint-multigram model to generate aligned training exemplars. Experiments conducted with the AMI05 dictionary demonstrate that a CRF significantly outperforms other models, especially if n-best lists of predictions are generated.
Original languageEnglish
Pages (from-to)122-125
Number of pages4
JournalIEEE Signal Processing Letters
Volume18
Issue number2
DOIs
Publication statusPublished - 1 Feb 2011

Fingerprint

Dive into the research topics of 'Letter-to-Sound Pronunciation Prediction Using Conditional Random Fields'. Together they form a unique fingerprint.

Cite this