Invert-Classify: Recovering Discrete Prosody Inputs for Text-To-Speech

Nicholas Sanders, Korin Richmond

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Modeling prosody in Text-to-Speech (TTS) is challenging due to ambiguous orthography and the high cost of annotating prosodic events. This study focuses on the modeling of contrastive focus, the emphasis of a word to contrast it to presuppositions held by an interlocutor. Modeling of contrastive focus can be done in TTS by using binary, symbolic inputs at the word level in a supervised setting. To address the absence of annotated data, we propose the Invert-Classify method, which leverages a frozen TTS model and unlabeled parallel text-speech data to recover missing contrastive focus inputs. Our approach achieves a binary F-score of up to 0.71 for contrastive focus annotation recovery, utilizing only 5-10 % of annotated training data. Furthermore, subjective listening tests show that training on additional data labeled via Invert-Classify enhances overall synthesis quality, as well as providing good control and plausible-sounding contrastive focus.
Original languageEnglish
Title of host publication2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
PublisherInstitute of Electrical and Electronics Engineers
Pages1-7
Number of pages7
ISBN (Print)979-8-3503-0690-3
DOIs
Publication statusPublished - 19 Jan 2024
Event2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) - Taipei, Taiwan
Duration: 16 Dec 202320 Dec 2023

Conference

Conference2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Period16/12/2320/12/23

Fingerprint

Dive into the research topics of 'Invert-Classify: Recovering Discrete Prosody Inputs for Text-To-Speech'. Together they form a unique fingerprint.

Cite this