Regarding the existence of the internal language model in CTC-based E2E ASR

Zeyu Zhao, Peter Bell

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Some End-to-End (E2E) Automatic Speech Recognition (ASR) models, such as Attention-based Encoder-Decoder (AED) and Recurrent Neural Network Transducer (RNN-T) are known to have components that effectively act as internal language models (ILM), implicitly modelling the prior probability of the output sequence. However, the existence of an ILM in pure Connectionist Temporal Classification (CTC) ASR systems remains debated. In this paper, we investigate the existence and strength of an ILM in CTC systems. Since CTC posterior probabilities cannot be analytically factorised, we propose a novel empirical method to probe the ILM. After validating our method on a hybrid DNN model with various external language models, we apply it to CTC models trained under different conditions, examining the effects of training data, modelling units, and training or pre-training methods. Our results show no strong evidence of an ILM in CTC-based ASR systems, even with the largest training dataset in our experiments. However, we make the surprising finding that when a CTC encoder is jointly trained with an AED loss, an ILM emerges, even when only the CTC component is used in decoding.
Original languageEnglish
Title of host publicationProceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of PublicationPiscataway, NJ, USA
PublisherInstitute of Electrical and Electronics Engineers
Pages1-5
Number of pages5
ISBN (Electronic)9798350368741
ISBN (Print)9798350368758
DOIs
Publication statusPublished - 7 Mar 2025
Event2025 IEEE International Conference on Acoustics, Speech, and Signal Processing - Hyderabad International Convention Centre, Hyderabad, India
Duration: 6 Apr 202511 Apr 2025
https://2025.ieeeicassp.org/

Publication series

NameProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
PublisherInstitute of Electrical and Electronics Engineers
ISSN (Print)1520-6149
ISSN (Electronic)2379-190X

Conference

Conference2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
Abbreviated titleICASSP 2025
Country/TerritoryIndia
CityHyderabad
Period6/04/2511/04/25
Internet address

Keywords / Materials (for Non-textual outputs)

  • automatic speech recognition
  • connectionist temporal classification
  • internal language model

Fingerprint

Dive into the research topics of 'Regarding the existence of the internal language model in CTC-based E2E ASR'. Together they form a unique fingerprint.

Cite this