Prediction of Head Motion from Speech Waveforms with a Canonical-Correlation-Constrained Autoencoder

Jinhong Lu, Hiroshi Shimodaira

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This study investigates the direct use of speech waveforms to predict head motion for speech-driven head-motion synthesis, whereas the use of spectral features such as MFCC as basic input features together with additional features such as energy and F0 is common in the literature. We show that, rather than combining different features that originate from waveforms, it is more effective to use waveforms directly predicting corresponding head motion. The challenge with the waveform-based approach is that waveforms contain a large amount of information irrelevant to predict head motion, which hinders the training of neural networks. To overcome the problem, we propose a canonical-correlation-constrained autoencoder (CCCAE), where hidden layers are trained to not only minimise the error but also maximise the canonical correlation with head motion. Compared with an MFCC-based system, the proposed system shows comparable performance in objective evaluation, and better performance in subject evaluation.
Original languageEnglish
Title of host publicationINTERSPEECH 2020
PublisherInternational Speech Communication Association
Pages1301-1305
Number of pages5
DOIs
Publication statusPublished - 25 Oct 2020
EventInterspeech 2020 - Virtual Conference, China
Duration: 25 Oct 202029 Oct 2020
http://www.interspeech2020.org/

Publication series

NameInterspeech
ISSN (Print)1990-9772

Conference

ConferenceInterspeech 2020
Abbreviated titleINTERSPEECH 2020
Country/TerritoryChina
CityVirtual Conference
Period25/10/2029/10/20
Internet address

Keywords / Materials (for Non-textual outputs)

  • software agents
  • head motion
  • Neural Networks
  • speech driven

Fingerprint

Dive into the research topics of 'Prediction of Head Motion from Speech Waveforms with a Canonical-Correlation-Constrained Autoencoder'. Together they form a unique fingerprint.

Cite this