A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

Shinji Takaki, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the state-of-the-art statistical parametric speech synthesis system, a speech analysis module, e.g. STRAIGHT spectral analysis, is generally used for obtaining accurate and stable spectral envelopes, and then low-dimensional acoustic features extracted from obtained spectral envelopes are used for training acoustic models. However, a spectral envelope estimation algorithm used in such a speech analysis module includes various processing derived from human knowledge. In this paper, we present our investigation of deep autoencoder based, non-linear, data-driven and unsupervised low-dimensional feature extraction using FFT spectral envelopes for statistical parametric speech synthesis. Experimental results showed that a text-to-speech synthesis system using deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes is indeed a promising approach.
Original languageEnglish
Title of host publication2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages5535-5539
Number of pages5
ISBN (Print)978-1-4799-9988-0
DOIs
Publication statusPublished - Mar 2016
Event41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - China, Shanghai, China
Duration: 20 Mar 201625 Mar 2016
https://www2.securecms.com/ICASSP2016/Default.asp

Conference

Conference41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
Abbreviated titleICASSP 2016
Country/TerritoryChina
CityShanghai
Period20/03/1625/03/16
Internet address

Fingerprint

Dive into the research topics of 'A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis'. Together they form a unique fingerprint.

Cite this