Abstract
End-to-end models yield impressive speech recognition results on clean datasets while having inferior performance on noisy datasets. To address this, we propose transfer learning from a clean dataset (WSJ) to a noisy dataset (CHiME-4) for connectionist temporal classification models. We argue that the clean classifier (the upper layers of a neural network trained on clean data) can force the feature extractor (the lower layers) to learn the underlying noise invariant patterns in the noisy dataset. While training on the noisy dataset, the clean classifier is either frozen or trained with a small learning rate. The feature extractor is trained with no learning rate re-scaling. The proposed method gives up to 15.5% relative character error rate (CER) reduction compared to models trained only on CHiME-4. Furthermore, we use the test sets of Aurora-4 to perform evaluation on unseen noisy conditions. Our method has significantly lower CERs (11.3% relative on average) on all 14 Aurora-4 test sets compared to the conventional transfer learning method (no learning rate re-scale for any layer), indicating our method enables the model to learn noise invariant features.
| Original language | English |
|---|---|
| Title of host publication | ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
| Publisher | Institute of Electrical and Electronics Engineers |
| Pages | 7024-7028 |
| Number of pages | 5 |
| ISBN (Electronic) | 978-1-5090-6631-5 |
| ISBN (Print) | 978-1-5090-6632-2 |
| DOIs | |
| Publication status | Published - 9 Apr 2020 |
| Event | 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing - Barcelona, Spain Duration: 4 May 2020 → 8 May 2020 Conference number: 45 |
Publication series
| Name | |
|---|---|
| Publisher | IEEE |
| ISSN (Print) | 1520-6149 |
| ISSN (Electronic) | 2379-190X |
Conference
| Conference | 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing |
|---|---|
| Abbreviated title | ICASSP 2020 |
| Country/Territory | Spain |
| City | Barcelona |
| Period | 4/05/20 → 8/05/20 |
Keywords / Materials (for Non-textual outputs)
- end-to-end
- robust speech recognition
- transfer learning