A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge

Daniel Renshaw, Herman Kamper, Aren Jansen, Sharon Goldwater

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The success of supervised deep neural networks (DNNs) in speech recognition cannot be transferred to zero-resource languages where the requisite transcriptions are unavailable. We investigate unsupervised neural network based methods for learning frame-level representations. Good frame representations eliminate differences in accent, gender, channel characteristics, and other factors to model subword units for within- and across-speaker phonetic discrimination. We enhance the correspondence autoencoder (cAE) and show that it can transform Mel Frequency Cepstral Coefficients (MFCCs) into more effective frame representations given a set of matched word pairs from an unsupervised term discovery (UTD) system. The cAE combines the feature extraction power of autoencoders with the weak supervision signal from UTD pairs to better approximate the extrinsic task’s objective during training. We use the Zero Resource Speech Challenge’s minimal triphone pair ABX discrimination task to evaluate our methods. Optimizing a cAE architecture on
English and applying it to a zero-resource language, Xitsonga, we obtain a relative error rate reduction of 35% compared to the original MFCCs. We also show that Xitsonga frame representations extracted from the bottleneck layer of a supervised DNN trained on English can be further enhanced by the cAE, yielding a relative error rate reduction of 39%.
Original languageEnglish
Title of host publicationINTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association
PublisherInternational Speech Communication Association
Pages3199-3203
Number of pages5
Publication statusPublished - Sep 2015

Fingerprint

Dive into the research topics of 'A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge'. Together they form a unique fingerprint.

Cite this