Edinburgh Research Explorer

Adapting and Controlling DNN-Based Speech Synthesis Using Input Codes

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions

Open

Documents

http://ieeexplore.ieee.org/document/7953089/
Original languageEnglish
Title of host publicationThe 42nd IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2017
PublisherIEEE
Pages1905-1909
Number of pages5
ISBN (Electronic)978-1-5090-4117-6
DOIs
StatePublished - 19 Jun 2017
Event42nd IEEE International Conference on Acoustics, Speech and Signal Processing - New Orleans, United States
Duration: 5 Mar 20179 Mar 2017
http://www.ieee-icassp2017.org/

Conference

Conference42nd IEEE International Conference on Acoustics, Speech and Signal Processing
Abbreviated titleICASSP 2017
CountryUnited States
CityNew Orleans
Period5/03/179/03/17
Internet address

Abstract

Methods for adapting and controlling the characteristics of output speech are important topics in speech synthesis. In this work, we investigated the performance of DNN-based text-to-speech systems that in parallel to conventional text input also take speaker, gender, and age codes as inputs, in order to 1) perform multi-speaker synthesis, 2) perform speaker adaptation using small amounts of targetspeaker adaptation data, and 3) modify synthetic speech characteristics based on the input codes. Using a large-scale, studio-quality speech corpus with 135 speakers of both genders and ages between
tens and eighties, we performed three experiments: 1) First, we used a subset of speakers to construct a DNN-based, multi-speaker acoustic model with speaker codes. 2) Next, we performed speaker adaptation by estimating code vectors for new speakers via backpropagation from a small amount of adaptation material. 3) Finally, we experimented with manually manipulating input code vectors to alter the gender and/or age characteristics of the synthesised speech. Experimental results show that high-performance multi-speaker models
can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

Event

42nd IEEE International Conference on Acoustics, Speech and Signal Processing

5/03/179/03/17

New Orleans, United States

Event: Conference

Download statistics

No data available

ID: 30908774