Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR

Zeyu Zhao, Peter Bell

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

End-to-end (E2E) Automatic Speech Recognition (ASR) has gained popularity in recent years, with most research focusing on designing novel neural network architectures, speech rep resentations, and loss functions. However, the importance of topology in E2E ASR has been largely neglected. There are many aspects of topology to consider; in this paper, we focus on the relationship between topologies’ minimum traversal time and output frame rate, the number of distinct states for each output unit, and the flexibility of alignments admitted. We ex amine several different topologies on two datasets: WSJ and Librispeech. Our experiments reveal that different frame rates have varying optimal topologies and that the commonly used Connectionist Temporal Classification (CTC) topology is not always optimal. Our findings suggest that the choice of topol ogy is an important consideration in the design of E2E ASR systems.
Original languageEnglish
Title of host publicationProc. INTERSPEECH 2023
PublisherInternational Speech Communication Association
Pages4903-4907
Number of pages5
DOIs
Publication statusPublished - 20 Aug 2023
EventInterspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023
Conference number: 24
https://www.interspeech2023.org/

Publication series

NameInterspeech
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/2324/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • automatic speech recognition
  • end-to-end ASR
  • differentiable WFST

Fingerprint

Dive into the research topics of 'Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR'. Together they form a unique fingerprint.

Cite this