Edinburgh Research Explorer

Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Original languageEnglish
Title of host publicationInterspeech 2017
Number of pages5
DOIs
Publication statusPublished - 20 Aug 2017
EventInterspeech 2017 - Stockholm, Sweden
Duration: 20 Aug 201724 Aug 2017
http://www.interspeech2017.org/

Conference

ConferenceInterspeech 2017
CountrySweden
CityStockholm
Period20/08/1724/08/17
Internet address

Abstract

We propose a simple new representation for the FFT spectrum tailored to statistical parametric speech synthesis. It consists of four feature streams that describe magnitude, phase and fundamental frequency using real numbers. The proposed feature extraction method does not attempt to decompose the speech structure (e.g., into source+filter or harmonics+noise). By avoiding the simplifications inherent in decomposition, we can dramatically reduce the “phasiness” and “buzziness” typical of most vocoders. The method uses simple and computationally cheap operations and can operate at a lower frame rate than the 200 frames-per-second typical in many systems. It avoids heuristics and methods requiring approximate or iterative solutions, including phase unwrapping.
Two DNN-based acoustic models were built - from male and female speech data - using the Merlin toolkit. Subjective comparisons were made with a state-of-the-art baseline, using the STRAIGHT vocoder. In all variants tested, and for both male and female voices, the proposed method substantially outperformed the baseline. We provide source code to enable our complete system to be replicated.

Event

Interspeech 2017

20/08/1724/08/17

Stockholm, Sweden

Event: Conference

Download statistics

No data available

ID: 37321648