Silent versus modal multi-speaker speech recognition from ultrasound and video

Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and eval- uate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.
Original languageEnglish
Title of host publication22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PublisherInternational Speech Communication Association
Pages641-645
Number of pages5
DOIs
Publication statusE-pub ahead of print - 3 Sept 2021
EventInterspeech 2021: The 22nd Annual Conference of the International Speech Communication Association - Brno, Czech Republic
Duration: 30 Aug 20213 Sept 2021
Conference number: 22
https://www.interspeech2021.org

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech 2021
Country/TerritoryCzech Republic
CityBrno
Period30/08/213/09/21
Internet address

Keywords / Materials (for Non-textual outputs)

  • silent speech interfaces
  • silent speech
  • ultrasound tongue imaging
  • video lip imaging
  • articulatory speech recognition

Fingerprint

Dive into the research topics of 'Silent versus modal multi-speaker speech recognition from ultrasound and video'. Together they form a unique fingerprint.

Cite this