Person tracking via audio and video fusion

Eleonora D'Arca, Neil M Robertson, James Hopgood

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In this paper we present a joint audio-video (AV) tracker which can track the active source between two freely moving persons speaking in turn to simulate a meeting scenario, but less constrained. Our tracker differs from existing work in that it requires only a small number of sensors, works when speaker is not close to the sensors and relies on simple, yet efficient, inference techniques in AV processing. The system uses audio and video measures of the target position on the ground plane to strengthen the single modality predictions that would be weak if taken on their own as occlusions, clutter, reverberations and speech pauses happen in the test environment. In particular, the inter-microphone signal delays and the target image locations are input to single modality Bayesian filters, whose proposed likelihoods are multiplied in a Kalman Filter to give the joint AV final estimation. Despite the low complexity of the system, results show that the multi-modal tracker does not fail, tolerating video occlusion and intermittent speech (within 50 cm of accuracy) in the context of a non-meeting scenario. The system evaluation is done both on single modality than multi-modality tracking, and the performance improvement given by the AV fusion is discussed and quantified i.e 24 % improvement on the audio tracker accuracy.
Original languageEnglish
Title of host publication9th IET Data Fusion and Target Tracking Conference
Subtitle of host publicationAlgorithms & Applications
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
ISBN (Electronic)978-1-84919-624-6
Publication statusPublished - 2 Aug 2012
Event9th IET Data Fusion & Target Tracking Conference: Algorithms & Applications - London, United Kingdom
Duration: 16 May 201217 May 2012


Conference9th IET Data Fusion & Target Tracking Conference
Country/TerritoryUnited Kingdom
Internet address


Dive into the research topics of 'Person tracking via audio and video fusion'. Together they form a unique fingerprint.

Cite this