Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments

Yoshitaka Mamiya, Adriana Stan, Junichi Yamagishi, Peter Bell, Oliver Watts, Robert Clark, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution


When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this paper we discuss the use of speaker adaptation methods to address this issue. Starting from a baseline system for automatic sentence-level segmentation and speech and text alignment based on GMMs and grapheme HMMs, respectively, we employ Maximum A Posteriori (MAP) and Constrained Maximum Likelihood Linear Regression (CMLLR) techniques to model the variation in the data in order to increase the amount of confidently aligned speech. We tested 29 different scenarios, which include reverberation, 8 talker babble noise and white noise, each in various combinations and SNRs. Results show that the MAP-based segmentation’s performance is very much influenced by the noise type, as well as the presence or absence of reverberation. On the other hand, the CMLLR adaptation of the acoustic models gives an average 20 % increase in the aligned data percentage for the majority of the studied scenarios. Index Terms: speech alignment, speech segmentation, adaptive training, CMLLR, MAP, VAD
Original languageEnglish
Title of host publicationProc. 8th ISCA Speech Synthesis Workshop
Number of pages6
Publication statusPublished - Aug 2013


  • adaptive training, CMLLR, MAP, speech alignment, speech segmentation, VAD

Fingerprint Dive into the research topics of 'Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments'. Together they form a unique fingerprint.

Cite this