Abstract / Description of output
In this work we present an analysis of temporal sensitivity of VQ-VAE sub-phone token sequences. Previous work has demonstrated that VQ-VAE systems learn a type of sub-phone representation. However, a detailed examination of the representations themselves is currently lacking. We address this gap by exploring linguistic unit reorganisation. Our experiments show that sub-phone codebook sequences are temporally correlated enough to identify VQ codes that correspond to distinct linguistic units. We found that it is possible to extract VQ codes and re-arrange these linguistic units in a meaningful way (i.e. changing the word-order of a sentence). This work puts us one step closer to understanding how to modify pronunciations at a fine granularity, such as below the phone-level unit.
Original language | English |
---|---|
Title of host publication | Proc. 11th ISCA Speech Synthesis Workshop |
Pages | 27--231 |
DOIs | |
Publication status | Published - 28 Aug 2021 |
Event | The 11th ISCA Speech Synthesis Workshop (SSW11) - Gárdony, Hungary Duration: 26 Aug 2021 → 28 Aug 2021 Conference number: 11 https://ssw11.hte.hu |
Conference
Conference | The 11th ISCA Speech Synthesis Workshop (SSW11) |
---|---|
Abbreviated title | SSW11 |
Country/Territory | Hungary |
City | Gárdony |
Period | 26/08/21 → 28/08/21 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- VQ-VAE
- speech synthesis
- representation learning