Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models

Léa-Marie Lam-Yee-Mui, Lucas Ondel Yang, Ondřej Klejch

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper investigates the potential of improving a hybrid automatic speech recognition model trained on 10 hours of transcribed data with 200 hours of untranscribed data in low-resource languages. First, we compare baseline methods of cross-lingual transfer with MFCC features and features extracted with the multilingual self-supervised model XLSR-53. Subsequently, we compare two approaches that can leverage the untranscribed data: semi-supervised training with LF-MMI and continued self-supervised pre-training of XLSR-53. Our results on well-resourced English broadcast data derived from MGB show that both methods achieve 18% and 27% relative improvements compared to the baseline, respectively. On the low-resource South African Soap Opera dataset, the relative improvement with semi-supervised training is only 3% due to the inherently weak language model. However, continued pre-training achieves 8.6% relative improvement because it does not rely on any external information.
Original languageEnglish
Title of host publicationProc. INTERSPEECH 2023
PublisherInternational Speech Communication Association
Pages87-91
Number of pages5
DOIs
Publication statusPublished - 20 Aug 2023
EventInterspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023
Conference number: 24
https://www.interspeech2023.org/

Publication series

NameInterspeech
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/2324/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • low-resource automatic speech recognition
  • self-supervised training
  • semi-supervised training

Fingerprint

Dive into the research topics of 'Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models'. Together they form a unique fingerprint.

Cite this