Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes

Shinji Takaki, Yoshikazu Nishimura, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A new speaker-adaptation technique for deep neural network (DNN)-based speech synthesis – which requires only speech data without orthographic transcriptions – is proposed. This technique is based on a DNN-based speech-synthesis model that takes speaker, gender, and age into consideration as additional inputs and outputs acoustic parameters of corresponding voices from text in order to construct a multi-speaker model and perform speaker adaptation. It uses a new input code that represents acoustic similarity to each of the training speakers in a probability. The new input code, called “speaker-similarity vector,” is obtained by concatenating posterior probabilities calculated from each model of the training speakers. GMM-UBM or i-vector/PLDA, which are widely used in text-independent speaker verification, are used to represent the speaker models, since they can be used without text information. Text and the speaker-similarity vectors of the training speakers are used as input to first train a multi-speaker speech-synthesis model, which outputs acoustic parameters of the training speakers. A new speaker-similarity vector is then estimated by using a small amount of speech data uttered by an unknown target speaker on the basis of the separately trained speaker models. It is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speaker’s voice. In objective and subjective experiments, adaptation performance of the proposed technique was evaluated using not only studioquality adaptation data but also low-quality (i.e., noisy and reverberant) data. The results of the experiments indicate that the proposed technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis.
Original languageEnglish
Title of host publicationAsia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018
Place of PublicationHonolulu, Hawaii, USA
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages649-658
Number of pages10
ISBN (Electronic)978-9-8814-7685-2, 978-988-14768-6-9
ISBN (Print)978-1-7281-0243-6
DOIs
Publication statusPublished - 7 Mar 2019
EventAsia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018 - Honolulu, United States
Duration: 12 Nov 201815 Nov 2018
https://apsipa2018.org/

Publication series

Name
PublisherIEEE
ISSN (Print)2640-009X
ISSN (Electronic)2640-0103

Conference

ConferenceAsia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018
Abbreviated titleAPSIPA ASC 2018
Country/TerritoryUnited States
CityHonolulu
Period12/11/1815/11/18
Internet address

Fingerprint

Dive into the research topics of 'Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes'. Together they form a unique fingerprint.

Cite this