Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion

T Kinnunen, E. S. Chng, Haizhou Li, Zhizheng Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In voice conversion, frame-level mean and variance normal- ization is typically used for fundamental frequency (F0) trans- formation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch con- tours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplic- ity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our ob- jective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normaliza- tion and the baseline GMM conversion. Index Terms: Voice conversion, F0 transformation, GMM, his- togram equalization, text-independence

Original languageEnglish
Title of host publicationInterspeech 2010
Pages2-5
Number of pages3
Publication statusPublished - 2010

Fingerprint

Dive into the research topics of 'Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion'. Together they form a unique fingerprint.

Cite this