Edinburgh Research Explorer

Embeddings for DNN speaker adaptive training

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions

Open

Documents

https://ieeexplore.ieee.org/document/9004028
Original languageEnglish
Title of host publication2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages479-486
Number of pages8
ISBN (Electronic)978-1-7281-0306-8
ISBN (Print)978-1-7281-0307-5
DOIs
Publication statusPublished - 20 Feb 2020
EventIEEE Automatic Speech Recognition and Understanding Workshop 2019 - Sentosa, Singapore
Duration: 14 Dec 201918 Dec 2019
http://asru2019.org/wp/

Conference

ConferenceIEEE Automatic Speech Recognition and Understanding Workshop 2019
Abbreviated titleASRU 2019
CountrySingapore
CitySentosa
Period14/12/1918/12/19
Internet address

Abstract

In this work, we investigate the use of embeddings for speaker-adaptive training of DNNs (DNN-SAT) focusing on a small amount of adaptation data per speaker. DNN-SAT can be viewed as learning a mapping from each embedding to transformation parameters that are applied to the shared parameters of the DNN. We investigate different approaches to applying these transformations, and find that with a good training strategy, a multi-layer adaptation network applied to all hidden layers is no more effective than a single linear layer acting on the embeddings to transform the input features. In the second part of our work, we evaluate different embeddings (i-vectors, x-vectors and deep CNN embeddings) in an additional speaker recognition task in order to gain insight into what should characterize an embedding for DNN-SAT. We find the performance for speaker recognition of a given representation is not correlated with its ASR performance; in fact, ability to capture more speech attributes than just speaker identity was the most important characteristic of the embeddings for efficient DNN-SAT ASR. Our best models achieved relative WER gains of 4% and 9% over DNN baselines using speaker-level cepstral mean normalisation (CMN), and a fully speaker-independent model, respectively.

    Research areas

  • speaker embeddings, utterance summary vectors, speaker adaptive training

Event

IEEE Automatic Speech Recognition and Understanding Workshop 2019

14/12/1918/12/19

Sentosa, Singapore

Event: Conference

Download statistics

No data available

ID: 118997013