Abstract / Description of output
We propose a learning-based postfilter to reconstruct the high-fidelity spectral texture in short-term Fourier transform (STFT) spectrograms. In speech-processing systems, such as speech synthesis, voice conversion, and speech enhancement, the STFT spectrograms have been widely used as key acoustic representations. In these tasks, we normally need to precisely generate or predict the representations from inputs; however, generated spectra typically lack the fine structures close to the true data. To overcome these limitations and reconstruct spectra having finer structures, we propose a generative adversarial network (GAN)-based postfilter that is implicitly optimized to match the true feature distribution in adversarial learning. The challenge with this postfilter is that a GAN cannot be easily trained for very high-dimensional data such as the STFT. Therefore, we introduce a divide-and-concatenate strategy. We first divide the spectrograms into multiple frequency bands with overlap, train the GAN-based postfilter for the individual bands, and finally connect the bands with overlap. We applied our proposed postfilter to a DNN-based speech-synthesis task. The results show that our proposed postfilter can be used to reduce the gap between synthesized and target spectra, even in the highdimensional STFT domain.
Original language | English |
---|---|
Title of host publication | Proceedings Interspeech 2017 |
Publisher | International Speech Communication Association |
Pages | 3389-3393 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 20 Aug 2017 |
Event | Interspeech 2017 - Stockholm, Sweden Duration: 20 Aug 2017 → 24 Aug 2017 http://www.interspeech2017.org/ |
Publication series
Name | Interspeech |
---|---|
Publisher | International Speech Communcation Association |
ISSN (Electronic) | 1990-9772 |
Conference
Conference | Interspeech 2017 |
---|---|
Country/Territory | Sweden |
City | Stockholm |
Period | 20/08/17 → 24/08/17 |
Internet address |