Source-filter Separation of Speech Signal in the Phase Domain

  • Loweimi, E. (Speaker)
  • Jon Barker (Speaker)
  • Thomas Hain (Speaker)

Activity: Academic talk or presentation typesOral presentation


Deconvolution of the speech excitation (source) and vocal tract (filter) components through log-magnitude spectral processing is well-established and has led to the well-known cepstral features used in a multitude of speech processing tasks. This paper presents a novel sourcefilter decomposition based on processing in the phase domain. The phase spectrum is rarely used in mainstream speech processing. There are three major reasons for its neglect. First, it is generally considered to contain little perceptual information. Second, the phase wrapping phenomenon renders the shape of the phase spectrum chaotic and noise-like. The wrapped spectrum lacks the meaningful trends or extremum points that are helpful when developing a model. Third, it has been shown that the speech phase spectrum is only informative when the speech signal is decomposed into long frames (e.g. 500 ms). This is problematic because using long frames violates the quasi-stationarity assumption that is the key motivation for framebased speech signal processing.
There are recent studies that provide new evidence for the perceptual importance of the phase spectrum. These studies generally incorporate information from the phase spectrum into an existing magnitude spectrum-based enhancement algorithm and then due to some improvement in the intelligibility/quality of the output signal conclude that phase is of some perceptual significance. So far no model is provided that can show how information is encoded in the phase spectrum. This paper provides such an account in the form of a novel phase-based source-filter model.
We show that separation between source and filter in the log-magnitude spectra is not perfect, leading to partial loss of vocal tract information. It is demonstrated that the same task can be better performed by trend and fluctuation analysis of the phase spectrum of the minimum-phase component of speech, which can be computed via the Hilbert transform. Trend and fluctuation can be separated through low-pass filtering of the phase spectrum, exploiting the additivity of the vocal tract and source responses in the phase domain. This results in separated signals which have a clear relation to the vocal tract and excitation components. The effectiveness of this approach to speech modelling is tested using a speech recognition task. The vocal tract component extracted in this way is employed as the basis of a feature extraction algorithm for speech recognition on the Aurora-2 database. The recognition results show up to 8.5% absolute improvement on average (0-20 dB) in comparison with MFCC features.
Period3 Jul 2015
Event titleFifth Speech Conference of UK and Ireland
Event typeConference
LocationNorwich, United Kingdom
Degree of RecognitionInternational