Abstract / Description of output
Modeling prosody in Text-to-Speech (TTS) is challenging due to ambiguous orthography and the high cost of annotating prosodic events. This study focuses on the modeling of contrastive focus, the emphasis of a word to contrast it to presuppositions held by an interlocutor. Modeling of contrastive focus can be done in TTS by using binary, symbolic inputs at the word level in a supervised setting. To address the absence of annotated data, we propose the Invert-Classify method, which leverages a frozen TTS model and unlabeled parallel text-speech data to recover missing contrastive focus inputs. Our approach achieves a binary F-score of up to 0.71 for contrastive focus annotation recovery, utilizing only 5-10 % of annotated training data. Furthermore, subjective listening tests show that training on additional data labeled via Invert-Classify enhances overall synthesis quality, as well as providing good control and plausible-sounding contrastive focus.
Original language | English |
---|---|
Title of host publication | 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 1-7 |
Number of pages | 7 |
ISBN (Print) | 979-8-3503-0690-3 |
DOIs | |
Publication status | Published - 19 Jan 2024 |
Event | 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) - Taipei, Taiwan Duration: 16 Dec 2023 → 20 Dec 2023 |
Conference
Conference | 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
---|---|
Period | 16/12/23 → 20/12/23 |