Term-dependent Confidence Normalization for Out-of-Vocabulary Spoken Term Detection

Dong Wang, Javier Tejedor, Simon King, Joe Frankel

Research output: Contribution to journalArticlepeer-review


Spoken Term Detection (STD) is a fundamental component of spoken information retrieval systems. A key task of an STD system is to determine reliable detections and reject false alarms based on certain confidence measures. The detection posterior probability, which is often computed from lattices, is a widely used confidence measure. However, a potential problem of this confidence measure is that the confidence scores of detections of all search terms are treated uniformly, regardless of how much they may differ in terms of phonetic or linguistic properties. This problem is particularly evident for out-of-vocabulary (OOV) terms which tend to exhibit high intra-term diversity. To address the discrepancy on confidence levels that the same confidence score may convey for different terms, a term-dependent decision strategy is desirable – for example, the term-specific threshold (TST) approach. In this work, we propose a term-dependent normalisation technique which compensates for term diversity on confidence estimation. Particularly, we propose a linear bias compensation and a discriminative compensation to deal with the bias problem that is inherent in lattice-based confidence measuring from which the TST approach suffers. We tested the proposed technique on speech data from the multi-party meeting domain with two state-of-the-art STD systems based on phonemes and words respectively. The experimental results demonstrate that the confidence normalisation approach leads to a significant performance improvement in STD, particularly for OOV terms with phoneme-based systems.
Original languageEnglish
Pages (from-to)358-375
Number of pages17
JournalJournal of Computer Science and Technology
Issue number2
Publication statusPublished - Mar 2012

Fingerprint Dive into the research topics of 'Term-dependent Confidence Normalization for Out-of-Vocabulary Spoken Term Detection'. Together they form a unique fingerprint.

Cite this