Abstract / Description of output
End-to-end (E2E) Automatic Speech Recognition (ASR) has gained popularity in recent years, with most research focusing on designing novel neural network architectures, speech rep resentations, and loss functions. However, the importance of topology in E2E ASR has been largely neglected. There are many aspects of topology to consider; in this paper, we focus on the relationship between topologies’ minimum traversal time and output frame rate, the number of distinct states for each output unit, and the flexibility of alignments admitted. We ex amine several different topologies on two datasets: WSJ and Librispeech. Our experiments reveal that different frame rates have varying optimal topologies and that the commonly used Connectionist Temporal Classification (CTC) topology is not always optimal. Our findings suggest that the choice of topol ogy is an important consideration in the design of E2E ASR systems.
Original language | English |
---|---|
Title of host publication | Proc. INTERSPEECH 2023 |
Publisher | International Speech Communication Association |
Pages | 4903-4907 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 20 Aug 2023 |
Event | Interspeech 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023 Conference number: 24 https://www.interspeech2023.org/ |
Publication series
Name | Interspeech |
---|---|
ISSN (Electronic) | 1990-9772 |
Conference
Conference | Interspeech 2023 |
---|---|
Country/Territory | Ireland |
City | Dublin |
Period | 20/08/23 → 24/08/23 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- automatic speech recognition
- end-to-end ASR
- differentiable WFST