Edinburgh Research Explorer

Simplifying very deep convolutional neural network architectures for robust speech recognition

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Original languageEnglish
Title of host publicationIEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2017)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages9
ISBN (Electronic)978-1-5090-4788-8
ISBN (Print)978-1-5090-4789-5
Publication statusPublished - 25 Jan 2018
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop - Okinawa, Japan
Duration: 16 Dec 201720 Dec 2017


Conference2017 IEEE Automatic Speech Recognition and Understanding Workshop
Abbreviated titleASRU 2017
Internet address


Very deep convolutional neural networks (VDCNNs) have been successfully used in computer vision. More recently VDCNNs have been applied to speech recognition, using architectures adopted from computer vision. In this paper, we experimentally analyse the role of the components in VDCNN architectures for robust speech recognition. We have proposed a number of simplified VDCNN architectures, taking into account the use of fully-connected layers and down-sampling approaches. We have investigated three ways to down-sample feature maps: max-pooling, average-pooling, and convolution with increased stride. Our proposed model, consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition. We have also extended our experiments to the MGB-3 task of multi-genre broadcast recognition using BBC TV recordings. The MGB-3 results indicate that the same architecture achieves the best result among our VDCNNs on this task as well.


2017 IEEE Automatic Speech Recognition and Understanding Workshop


Okinawa, Japan

Event: Conference

Download statistics

No data available

ID: 44898953