Projects per year
Abstract / Description of output
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker’s voice, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker’s voice, even without any training data for the new, unseen language.
Original language | English |
---|---|
Pages (from-to) | 4036-4051 |
Journal | IEEE/ACM Transactions on Audio, Speech and Language Processing |
Volume | 32 |
Early online date | 6 Sept 2024 |
DOIs | |
Publication status | Published - 2024 |
Keywords / Materials (for Non-textual outputs)
- text-to-speech
- multilingual
- self-supervised representations
- low-resource
- zero-shot
Fingerprint
Dive into the research topics of 'ZMM-TTS: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations'. Together they form a unique fingerprint.Projects
- 1 Active
-
Speech Generation for Indigenous Language Education
National Research Council Canada
3/02/23 → 2/12/25
Project: Research