A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis

A. Gallardo-Antolín, J.M. Montero, S. King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association
Subtitle of host publicationInterspeech 2014
PublisherISCA
Pages2370-2374
DOIs
Publication statusPublished - 2014

Keywords / Materials (for Non-textual outputs)

  • diarization
  • audio segmentation
  • expressive text to-speech
  • media recordings

Fingerprint

Dive into the research topics of 'A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis'. Together they form a unique fingerprint.

Cite this