An analysis on the effects of speaker embedding choice in non auto-regressive TTS

Adriana Stan, Johannah O'Mahony

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network’s core speech abstraction (i.e.zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.
Original languageEnglish
Title of host publicationProceedings of the 12th ISCA Speech Synthesis Workshop
Subtitle of host publication(SSW2023)
EditorsGérard Bailly, Thomas Hueber, Damien Lolive, Nicolas Obin , Olivier Perrotin
Place of PublicationGrenoble
PublisherISCA
Pages134-138
Number of pages5
DOIs
Publication statusPublished - 28 Aug 2023
Event12th ISCA Speech Synthesis Workshop - Grenoble, France
Duration: 26 Aug 202328 Aug 2023
https://ssw2023.org

Publication series

NameProceedings of the ISCA Workshop
PublisherISCA
ISSN (Print)1680-8908

Conference

Conference12th ISCA Speech Synthesis Workshop
Abbreviated titleSSW
Country/TerritoryFrance
CityGrenoble
Period26/08/2328/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • speech synthesis
  • speaker embeddings
  • multi-speaker TTS
  • speaker disentanglement
  • speaker verification
  • non-autoregressive TTS
  • factorised TTS

Fingerprint

Dive into the research topics of 'An analysis on the effects of speaker embedding choice in non auto-regressive TTS'. Together they form a unique fingerprint.

Cite this