Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation

Christiaan Jacobs, Yevgen Matusevych, Herman Kamper

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based re-current models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an unseen zero-resource language. We consider how a recent contrastive learning loss can be used in both the purely unsupervised and multilingual transfer settings. Firstly, we show that terms from an unsupervised term discovery system can be used for contrastive self-supervision, resulting in improvements over previous unsupervised monolingual AWE models. Secondly, we consider how multilingual AWE models can be adapted to a specific zero-resource language using discovered terms. We find that self-supervised contrastive adaptation outperforms adapted multilingual correspondence autoencoder and Siamese AWE models, giving the best overall results in a word discrimination task on six zero-resource languages.
Original languageEnglish
Title of host publication2021 IEEE Spoken Language Technology Workshop (SLT)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages919-926
Number of pages8
ISBN (Electronic)978-1-7281-7066-4
ISBN (Print)978-1-7281-7067-1
DOIs
Publication statusPublished - 25 Mar 2021
EventIEEE Spoken Language Technology Workshop -
Duration: 19 Jan 202122 Jan 2021

Conference

ConferenceIEEE Spoken Language Technology Workshop
Abbreviated titleSLT 2021
Period19/01/2122/01/21

Keywords

  • Acoustic word embeddings
  • unsupervised speech processin
  • transfer learning
  • self-supervised learning

Fingerprint

Dive into the research topics of 'Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation'. Together they form a unique fingerprint.

Cite this