An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets

Pilar Oplustil Gallegos*, Jennifer Williams, Joanna Rownicka, Simon King

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association
Pages1758-1762
Number of pages5
Volume2020-October
DOIs
Publication statusPublished - 29 Oct 2020
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: 25 Oct 202029 Oct 2020

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X

Conference

Conference21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/TerritoryChina
CityShanghai
Period25/10/2029/10/20

Keywords / Materials (for Non-textual outputs)

  • clustering
  • data
  • multi-speaker
  • sequence-to-sequence models
  • speaker representation
  • speech synthesis

Fingerprint

Dive into the research topics of 'An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets'. Together they form a unique fingerprint.

Cite this