Asynchronous factorisation of speaker and background with feature transforms in speech recognition

O. Saz, T. Hain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents a novel approach to separate the effects of speaker and background conditions by application of feature–transform based adaptation for Automatic Speech Recognition (ASR). So far factorisation has been shown to yield improvements in the case of utterance-synchronous environments. In this paper we show successful separation of conditions asynchronous with speech, such as background music. Our work takes account of the asynchronous nature of the background, by estimation of condition-specific Constrained Maximum Likelihood Linear Regression (CMLLR) transforms. In addition, speaker adaptation is performed, allowing to factorise speaker and background effects. Equally, background transforms are used asynchronously in the decoding process, using a modified Hidden Markov Model (HMM) topology which applies the optimal transform for each frame. Experimental results are presented on the WSJCAM0 corpus of British English speech, modified to contain controlled sections of background music. This addition of music degrades the baseline Word Error Rate (WER) from 10.1% to 26.4%. While synchronous factorisation with CMLLR transforms provides 28% relative improvement in WER over the baseline, our asynchronous approach increases this reduction to 33%.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2013
PublisherISCA
Publication statusPublished - 1 Aug 2013

Fingerprint Dive into the research topics of 'Asynchronous factorisation of speaker and background with feature transforms in speech recognition'. Together they form a unique fingerprint.

Cite this