Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by a universal phone recogniser trained on out-of-language speech corpora, which we follow with flat-start semi-supervised training to obtain an acoustic model for the new language. To the best of our knowledge, this is the first practical approach to zero-resource cross-lingual ASR which does not rely on any hand-crafted phonetic information. We carry out experiments on read speech from the GlobalPhone corpus, and show that it is possible to learn a decipherment model on just 20 minutes of data from the target language. When used to generate pseudo-labels for semi-supervised training, we obtain WERs that range from 32.5% to just 1.9% absolute worse than the equivalent fully supervised models trained on the same data.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2022
EditorsHanseok Ko, John H. L. Hansen
PublisherISCA
Pages2288-2292
Number of pages5
DOIs
Publication statusPublished - 18 Sept 2022
EventInterspeech 2022 - Incheon, Korea, Republic of
Duration: 18 Sept 202222 Sept 2022
Conference number: 23
https://interspeech2022.org/

Conference

ConferenceInterspeech 2022
Country/TerritoryKorea, Republic of
CityIncheon
Period18/09/2222/09/22
Internet address

Keywords / Materials (for Non-textual outputs)

  • automatic speech recognition
  • cross-lingual transfer
  • decipherment
  • semi-supervised training

Fingerprint

Dive into the research topics of 'Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR'. Together they form a unique fingerprint.

Cite this