Edinburgh Research Explorer

Prof Simon King

Personal Chair of Speech Processing

Profile photo

Phone: 0131 651 1725

Willingness to take Ph.D. students: Yes

Research Interests

As an engineer, the way I learn how something works is to take it apart and put it back together again. The fundamental questions around which most of my research revolves are: What are the basic building blocks of speech? What is speech made of? How can we take speech apart and put it back together again? Will that tell us how it works? To answer these questions, I am working in a number of areas. Often, the research methodology involves solving a practical problem or building a useful application, such as a speech recogniser, synthesiser or a speech search engine.

In speech recognition, I have investigated new acoustic models, such as Linear Dynamical Models, factorial-HMMs and other graphical models that can represent speech not as `beads on a string' but as streams of interacting factors. I've devised ways to automatically find an inventory of suitable sub-word units to model, as well as working on other alternatives to phonetic units, such as graphemes. This work has been applied to automatic speech recognition and to Spoken Term Detection, which is an underpinning technique required to realise speech search engines. One long-standing interest is the use of phonological/acoustic/articulatory features and articulatory measurement data as a tool to develop models of speech.

In speech synthesis, I work on both unit selection methods and HMM-based / DNN-based speech synthesis. In both of these areas, the definition of the unit of speech is crucial. Both typically use context-dependent phonemes or diphones. In this context, we can gain some insight into the basic building blocks of speech by asking What contextual features must we model? In unit selection, this means learning the target cost; in HMM- or DNN-based speech synthesis it relates to the clustering of acoustically similar units or the representation of linguistic context. 

I am increasingly interested in perceptual measures in speech synthesis, not just for evaluation of the final output, but within the synthesis process itself. In unit selection, perceptual measures should be used to determine equivalent units or contexts, because acoustic similarity and perceptual interchangeability are not the same thing. In HMM-based speech synthesis, the training criterion should be perceptual: perhaps minimum generation error gives us a way to use such a criterion? How can the requirements of acoustic modelling fit with this idea of perceptual equivalence?

In both recognition and synthesis, I work on multilingual systems as an additional way to look at the basic units of speech. Is there a universal set of building blocks for speech, and can we build systems that use common model structures or unit inventories for multiple languages?

Building large complex systems, as is often the case in speech technology, requires a team. I have therefore built up and sustained a team of postdoctoral Research Fellows, supported by research grants, with whom I pursue these research interests.

So far, I have helped to define three new research areas. In the late 1990s, I pioneered the use of articulatory feature representations of speech for use in automatic speech recognition. This was an early attempt at deconstructing speech into a factorial representation. The article ``Detection of phonological features in continuous speech using neural networks'' remains one of the key references in this field and has been cited 69 times to date. The results reported in this article are still the `ones to beat'.

Following on from the work on articulatory features, I started to work within the dynamic Bayesian network formalism, spending time at the University of Washington to collaborate with the leading groups in this area; my work led to me being invited to co-lead a summer workshop on this topic at Johns Hopkins University in 2006.

For many years, I was one of only a very small number of people around the world who worked in both automatic speech recognition and speech synthesis. Within the last few years, the number of people `crossing over' from recognition to synthesis has exploded. I am a pioneer in this newly emerging field -- ``unified models for speech recognition and synthesis'' -- which is already delivering exciting results.  I was the Project Director and co-ordinator for the EC FP7-funded six-partner 3 million euro project EMIME, which was a high profile project exploring this area. I was then the Project Director and co-ordinator for the project Simple4All, which made significant advances in unsupervised learning for speech synthesis. This was followed by the 5 year Natural Speech Technology project (a programme grant) in which we made advances into the machine learning for speech synthesis front end, and Deep Neural Networks. That work is now being directly applied in the SCRIPT project, working with the BBC World Service to deliver speech synthesis techniques that scale well across languages.

Visiting and Research Positions

  • 2010– Full professor, University of Edinburgh
  • 2015-16 Invited Visiting Professor, Universidad de Chile
  • 2015 Invited Visiting Professor, Aalto University, Helsinki, Finland
  • 2014 Invited Visiting Professor, Universidad Politecnica Madrid, Spain
  • 2013-4 Invited Visiting Professor, Universidad de Chile
  • 2012 Invited Visiting Professor, Universiti Teknologi Malaysia
  • 2010–2017 Visiting Associate Professor, Nagoya Institute of Technology, Japan
  • 2009 Invited Visiting Professor, Universiti Teknologi Malaysia
  • 2008 Invited Visitor, Nagoya Institute of Technology, Japan
  • 2008 Invited Visitor, Universidad de Chile
  • 2007–2010 Reader, University of Edinburgh
  • 2006 Invited senior researcher at Johns Hopkins University, USA
  • 2005-9 Advanced Research Fellow (Engineering and Physical Sciences Research Council)
  • 2004 Visiting researcher at University of Washington, USA
  • 2000–2007 Lecturer, University of Edinburgh
  • 1997–2000 Research Assistant / Research Fellow, Centre for Speech Technology Research, University of Edinburgh
  • 1996 Research Assistant, University of Bonn, Germany


Past: Speech Processing 1 and Speech Processing 2; Speech Synthesis and Automatic Speech Recognition; Advanced Phonetics; Annual Good Academic Practice session for all incoming Linguistics Masters students.

Current: M.Sc. SLP programme; Speech Processing (undergraduate and postgraduate); Speech Synthesis (undergraduate and postgraduate)


Administrative Roles

Director of the Centre for Speech Technology Research

The Centre for Speech Technology Research (CSTR) is a world-leading research centre in the field of speech technology, including speech synthesis and automatic speech recognition. I became Director of CSTR in February 2011. The two main responsibilities of Director are leadership of the group and sustaining grant income. CSTR varies in size, generally being around 20-35 people, and enjoys an enviable reputation as a welcoming, open and collaborative group. Maintaining this culture over the years, as both the student and Research Fellow members of CSTR gradually change, is a key part of our success. My approach to this, and to leadership in general, is simply one of leading by example.

The financial responsibilities of the Director of CSTR include ensuring a smooth funding profile is maintained, in order to support a stable group of 10-15 postdoctoral Research Fellows, all of whom are on soft money. CSTR also supports a number of administrative staff who are also mainly paid from soft money. The responsibility of bringing in the research grant income is shared between myself and Prof. Steve Renals: between us, we bring in the majority of funding to CSTR.

M.Sc. programme redesign, relaunch and direction

In 2001, I redesigned and relaunched the M.Sc. in Speech and Language Processing. At the time I took over this programme, it had become outdated and was examined solely on coursework. Many courses were taught specially and only for this M.Sc. I completely overhauled the course design, bringing it into line with the existing modular structure of the Informatics M.Sc. programme. This enabled the sharing of modules across the two programmes, decreased the teaching effort required from Linguistics, increased the number of option modules available to students, and brought Informatics M.Sc students into Linguistics-taught modules.


1998 University of Edinburgh, Ph.D
1993 University of Cambridge, M.Phil. (taught course)
1992 University of Cambridge, B.A., First Class Honours

Research students

  • 2003 Joe Frankel
  • 2004 James Horlock
  • 2004 Jithendra Vepa
  • 2005 Olga Goubanova
  • 2005 Yoshinori Shiga
  • 2006 Alexander Gutkin
  • 2008 Fiona Couper Kenney
  • 2010 Dong Wang
  • 2010 Peter Bell
  • 2011 Partha Lal
  • 2012 Oliver Watts
  • 2013 Moses Ekpenyong (University of Uyo, Nigeria)
  • 2013 Cassia Valentini Botinhao
  • 2016 Mark Sinclair
  • 2016 Thomas Merritt
  • current Rasmus Dall
  • current Srikanth Ronanki
  • current Felipe Espic
  • current Avashna Govender

ID: 28315