Projects per year
Abstract
Recently, there has been an increasing interest in end-to-end speech recognition using neural networks, with no reliance on hidden Markov models (HMMs) for sequence modelling as in the standard hybrid framework. The recurrent neural network (RNN)
encoder-decoder is such a model, performing sequence to sequence mapping without any predefined alignment. This model first transforms the input sequence into a fixed length vector representation, from which the decoder recovers the output sequence. In this paper, we extend our previous work on this model for large vocabulary end-to-end speech recognition. We first present a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy. We then extend the decoder with long memory by introducing another recurrent layer that performs implicit language modelling. Finally, we demonstrate that using multiple recurrent layers in the encoder can reduce the word error rate. Our experiments were carried out on the Switchboard corpus using a training set of around 300 hours of transcribed audio data, and we have achieved significantly higher recognition accuracy, thereby reduced the gap compared to the hybrid baseline.
encoder-decoder is such a model, performing sequence to sequence mapping without any predefined alignment. This model first transforms the input sequence into a fixed length vector representation, from which the decoder recovers the output sequence. In this paper, we extend our previous work on this model for large vocabulary end-to-end speech recognition. We first present a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy. We then extend the decoder with long memory by introducing another recurrent layer that performs implicit language modelling. Finally, we demonstrate that using multiple recurrent layers in the encoder can reduce the word error rate. Our experiments were carried out on the Switchboard corpus using a training set of around 300 hours of transcribed audio data, and we have achieved significantly higher recognition accuracy, thereby reduced the gap compared to the hybrid baseline.
| Original language | English |
|---|---|
| Title of host publication | 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
| Publisher | Institute of Electrical and Electronics Engineers |
| Pages | 5060-5064 |
| Number of pages | 5 |
| ISBN (Electronic) | 978-1-4799-9988-0 |
| ISBN (Print) | 978-1-4799-9987-3 |
| DOIs | |
| Publication status | Published - Mar 2016 |
| Event | 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - China, Shanghai, China Duration: 20 Mar 2016 → 25 Mar 2016 https://www2.securecms.com/ICASSP2016/Default.asp |
Conference
| Conference | 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 |
|---|---|
| Abbreviated title | ICASSP 2016 |
| Country/Territory | China |
| City | Shanghai |
| Period | 20/03/16 → 25/03/16 |
| Internet address |
Fingerprint
Dive into the research topics of 'On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Natural Speech Technology
Renals, S. (Principal Investigator) & King, S. (Co-investigator)
1/05/11 → 31/07/16
Project: Research