Speaker-aware training of LSTM-RNNs for acoustic modelling

Tian Tan, Yanmin Qian, Dong Yu, Souvik Kundu, Liang Lu, Xiong Xiao, Khe Chai SIM, Yu Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models has not been well investigated. In this paper, we study the LSTM-RNN speaker-aware training that incorporates the speaker information during model training to normalise the speaker variability. We first present several speaker-aware training architectures, and then empirically evaluate three types of speaker representation: I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training under the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs reduces word error rates by 6.5% relative to a very strong LSTM-RNN baseline, which uses FMLLR features.
Original languageEnglish
Title of host publication2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages5
ISBN (Print)978-1-4799-9988-0
Publication statusPublished - Mar 2016
Event41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - China, Shanghai, China
Duration: 20 Mar 201625 Mar 2016


Conference41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
Abbreviated titleICASSP 2016
Internet address


Dive into the research topics of 'Speaker-aware training of LSTM-RNNs for acoustic modelling'. Together they form a unique fingerprint.

Cite this