Edinburgh Research Explorer

Multiple-average-voice-based speech synthesis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions

Open

Documents

  • Download as Adobe PDF

    Rights statement: © Lanchantin, P., Gales, M. J. F., King, S., & Yamagishi, J. (2014). Multiple-average-voice-based speech synthesis. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. (pp. 285-289). [6853603] Institute of Electrical and Electronics Engineers Inc.. 10.1109/ICASSP.2014.6853603

    Accepted author manuscript, 355 KB, PDF-document

Original languageEnglish
Title of host publicationICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages285-289
Number of pages5
ISBN (Print)9781479928927
DOIs
Publication statusPublished - 4 May 2014
EventICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Florence, Italy, Florence, United Kingdom
Duration: 4 May 20149 May 2014

Conference

ConferenceICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
CountryUnited Kingdom
CityFlorence
Period4/05/149/05/14

Abstract

This paper describes a novel approach for the speaker adaptation of statistical parametric speech synthesis systems based on the interpolation of a set of average voice models (AVM). Recent results have shown that the quality/naturalness of adapted voices depends on the distance from the average voice model used for speaker adaptation. This suggests the use of several AVMs trained on carefully chosen speaker clusters from which a more suitable AVM can be selected/interpolated during the adaptation. In the proposed approach a set of AVMs, a multiple-AVM, is trained on distinct clusters of speakers which are iteratively re-assigned during the estimation process initialised according to metadata. During adaptation, each AVM from the multiple-AVM is first adapted towards the target speaker. The adapted means from the AVMs are then interpolated to yield the final speaker adapted mean for synthesis. It is shown, performing speaker adaptation on a corpus of British speakers with various regional accents, that the quality/naturalness of synthetic speech of adapted voices is significantly higher than when considering a single factor-independent AVM selected according to the target speaker characteristics.

    Research areas

  • cluster adaptive training, HMM-Based speech synthesis, multiple average voice model, speaker adaptation

Download statistics

No data available

ID: 19841631