Abstract
Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation. A mel-spectrogram is extracted from the waveform by a simple, fast DSP operation, but generating a high-quality waveform from a mel-spectrogram requires computation ally expensive machine learning: a neural vocoder. Our proposed “autovocoder” reverses this arrangement. We use machine learning to obtain a representation that replaces the mel-spectrogram, and that can be inverted back to a wave form using simple, fast operations including a differentiable implementation of the inverse STFT. The autovocoder generates a waveform 5 times faster than the DSP-based Griffin-Lim algorithm, and 14 times faster than the neural vocoder HiFi-GAN. We provide perceptual listening test results to confirm that the speech is of comparable quality to HiFi-GAN in the copy synthesis task.
| Original language | English |
|---|---|
| Title of host publication | ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
| Publisher | Institute of Electrical and Electronics Engineers |
| Number of pages | 5 |
| ISBN (Electronic) | 9781728163277 |
| ISBN (Print) | 9781728163284 |
| DOIs | |
| Publication status | Published - 5 May 2023 |
| Event | 2023 IEEE International Conference on Acoustics, Speech and Signal Processing - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023 https://2023.ieeeicassp.org/ |
Publication series
| Name | International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
|---|---|
| Publisher | IEEE |
| ISSN (Print) | 1520-6149 |
| ISSN (Electronic) | 2379-190X |
Conference
| Conference | 2023 IEEE International Conference on Acoustics, Speech and Signal Processing |
|---|---|
| Abbreviated title | ICASSP |
| Country/Territory | Greece |
| City | Rhodes Island |
| Period | 4/06/23 → 10/06/23 |
| Internet address |