Comparison and analysis of new curriculum criteria for end-to-end ASR

Georgios Karakasidis*, Mikko Kurimo*, Peter Bell*, Tamás Grósz*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and structured learning has the ability to enable faster and better understanding of the underlying concepts. For example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards more complex structures such as words and sentences. Motivated by this observation, researchers have started to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty, resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning (CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems, specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural network that performs the recognition task, in contrast to the traditional way of having several specialized components focusing on different subtasks (e.g., acoustic and language modelling). We demonstrate that end-to-end models can achieve better performances if they are provided with an organized training set consisting of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define the notion of an easy example, we explored multiple solutions that use either external, static scoring methods or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that control how much data is presented to the network during each training epoch. Our proposed curriculum learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English speech. Empirical results showed that a good curriculum strategy can yield performance improvements and speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease in terms of test set word error rate for the Finnish and English data sets, respectively.
Original languageEnglish
Article number103113
JournalSpeech Communication
Early online date31 Jul 2024
DOIs
Publication statusPublished - Sept 2024

Keywords / Materials (for Non-textual outputs)

  • curriculum learning
  • speech recognition
  • ASR
  • end-to-end
  • deep learning

Fingerprint

Dive into the research topics of 'Comparison and analysis of new curriculum criteria for end-to-end ASR'. Together they form a unique fingerprint.

Cite this