Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Acoustic word embeddings are typically created by training a pooling function using pairs of word-like units. For unsupervised systems, these are mined using k-nearest neighbor (KNN) search, which is slow. Recently, mean-pooled representations from a pre-trained self-supervised English model were suggested as a promising alternative, but their performance on target languages was not fully competitive. Here, we explore improvements to both approaches: we use continued pre-training to adapt the self-supervised model to the target language, and we use a multilingual phone recognizer (MPR) to mine phone n-gram pairs for training the pooling function. Evaluating on four languages, we show that both methods outperform a recent approach on word discrimination. Moreover, the MPR method is orders of magnitude faster than KNN, and is highly data efficient. We also show a small improvement from performing learned pooling on top of the continued pre-trained representations.
Original languageEnglish
Title of host publicationProc. INTERSPEECH 2023
PublisherInternational Speech Communication Association
Pages406-410
Number of pages5
DOIs
Publication statusPublished - 20 Aug 2023
EventInterspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023
Conference number: 24
https://www.interspeech2023.org/

Publication series

NameInterspeech
ISSN (Print)1990-9772

Conference

ConferenceInterspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/2324/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • acoustic word embeddings
  • semi-supervised learning
  • continued pre-training
  • low-resource languages
  • unwritten languages

Fingerprint

Dive into the research topics of 'Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling'. Together they form a unique fingerprint.

Cite this