Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human

Nicolas Obin, Christophe Veaux, Pierre Lanchantin

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

The absence of alternatives/variants is a dramatical limitation of text-to-speech (TTS) synthesis compared to the variety of human speech. This chapter introduces the use of speech alternatives/variants in order to improve TTS synthesis systems. Speech alternatives denote the variety of possibilities that a speaker has to pronounce a sentence—depending on linguistic constraints, specific strategies of the speaker, speaking style, and pragmatic constraints. During the training, symbolic and acoustic characteristics of a unit-selection speech synthesis system are statistically modelled with context-dependent parametric models (Gaussian mixture models (GMMs)/hidden Markov models (HMMs)). During the synthesis, symbolic and acoustic alternatives are exploited using a Generalized Viterbi Algorithm (GVA) to determine the sequence of speech units used for the synthesis. Objective and subjective evaluations support evidence that the use of speech alternatives significantly improves speech synthesis over conventional speech synthesis systems. Moreover, speech alternatives can also be used to vary the speech synthesis for a given text. The proposed method can easily be extended to HMM-based speech synthesis.
Original languageEnglish
Title of host publicationSpeech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis
Subtitle of host publicationPart III
EditorsKeikichi Hirose, Jianhua Tao
PublisherSpringer Berlin Heidelberg
Pages189-202
Number of pages14
ISBN (Electronic)978-3-662-45258-5
ISBN (Print)978-3-662-45257-8
DOIs
Publication statusPublished - 26 Feb 2015

Publication series

NameProsody, Phonology and Phonetics
PublisherSpringer Berlin Heidelberg
ISSN (Print)2197-8700
ISSN (Electronic)2197-8719

Fingerprint

Dive into the research topics of 'Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human'. Together they form a unique fingerprint.

Cite this