Analyzing acoustic word embeddings from pre-trained self-supervised speech models

Ramon Sanabria Teixidor, Hao Tang, Sharon Goldwater

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments. In this work, we study several pre-trained models and pooling methods for constructing AWEs with self-supervised representations. Owing to the contextualized nature of self-supervised representations, we hypothesize that simple pooling methods, such as averaging, might already be useful for constructing AWEs. When evaluating on a standard word discrimination task, we find that HuBERT representations with mean-pooling rival the state of the art on English AWEs. More surprisingly, despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual model XLSR-53 (as well as Wav2Vec 2.0 trained on English).
Original languageEnglish
Title of host publication2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherInstitute of Electrical and Electronics Engineers
Number of pages5
ISBN (Electronic)9781728163277
ISBN (Print)9781728163284
DOIs
Publication statusPublished - 5 May 2023
Event2023 IEEE International Conference on Acoustics, Speech and Signal Processing - Rhodes Island, Greece
Duration: 4 Jun 202310 Jun 2023
https://2023.ieeeicassp.org/

Publication series

NameInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP)
PublisherIEEE
ISSN (Print)1520-6149
ISSN (Electronic)2379-190X

Conference

Conference2023 IEEE International Conference on Acoustics, Speech and Signal Processing
Abbreviated titleICASSP
Country/TerritoryGreece
CityRhodes Island
Period4/06/2310/06/23
Internet address

Fingerprint

Dive into the research topics of 'Analyzing acoustic word embeddings from pre-trained self-supervised speech models'. Together they form a unique fingerprint.

Cite this