Analysing temporal sensitivity of VQ-VAE Sub-Phone Codebooks

Jason Fong, Jennifer Williams, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

In this work we present an analysis of temporal sensitivity of VQ-VAE sub-phone token sequences. Previous work has demonstrated that VQ-VAE systems learn a type of sub-phone representation. However, a detailed examination of the representations themselves is currently lacking. We address this gap by exploring linguistic unit reorganisation. Our experiments show that sub-phone codebook sequences are temporally correlated enough to identify VQ codes that correspond to distinct linguistic units. We found that it is possible to extract VQ codes and re-arrange these linguistic units in a meaningful way (i.e. changing the word-order of a sentence). This work puts us one step closer to understanding how to modify pronunciations at a fine granularity, such as below the phone-level unit.
Original languageEnglish
Title of host publicationProc. 11th ISCA Speech Synthesis Workshop
Publication statusPublished - 28 Aug 2021
EventThe 11th ISCA Speech Synthesis Workshop (SSW11) - Gárdony, Hungary
Duration: 26 Aug 202128 Aug 2021
Conference number: 11


ConferenceThe 11th ISCA Speech Synthesis Workshop (SSW11)
Abbreviated titleSSW11
Internet address

Keywords / Materials (for Non-textual outputs)

  • VQ-VAE
  • speech synthesis
  • representation learning


Dive into the research topics of 'Analysing temporal sensitivity of VQ-VAE Sub-Phone Codebooks'. Together they form a unique fingerprint.

Cite this