Phonetic segmentation of speech using STEP and t-SNE

Adriana Stan, Cassia Valentini-Botinhao, Mircea Giurgiu, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper introduces a first attempt to perform phoneme-level segmentation of speech based on a perceptual representation - the Spectro Temporal Excitation Pattern (STEP) - and a dimensionality reduction technique - the t-Distributed Stochastic Neighbour Embedding (t-SNE). The method searches for the true phonetic boundaries in the vicinity of those produced by an HMM-based segmentation. It looks for perceptually-salient spectral changes which occur at these phonetic transitions, and exploits t-SNE's ability to capture both local and global structure of the data. The method is intended to be used in any language and it is therefore not tailored to any particular dataset or language. Results show that this simple approach improves segmentation accuracy of unvoiced phonemes by 4% within a 5 ms margin, and 5% at a 10 ms margin. For the voiced phonemes, however, accuracy drops slightly.

Original languageEnglish
Title of host publication2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
EditorsCorneliu Burileanu, Corneliu Rusu, Horia-Nicolai Teodorescu, Horia-Nicolai Teodorescu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781467375603
DOIs
Publication statusPublished - 3 Dec 2015
Event8th International Conference on Speech Technology and Human-Computer Dialogue - Bucharest, Romania
Duration: 14 Oct 201517 Oct 2015
https://sped.pub.ro/archive/sped2015/index-1.html

Conference

Conference8th International Conference on Speech Technology and Human-Computer Dialogue
Abbreviated titleSpeD 2015
Country/TerritoryRomania
CityBucharest
Period14/10/1517/10/15
Internet address

Keywords

  • HMM acoustic model
  • k-Means
  • phonetic segmentation
  • STEP
  • t-SNE

Fingerprint

Dive into the research topics of 'Phonetic segmentation of speech using STEP and t-SNE'. Together they form a unique fingerprint.

Cite this