Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arakadiusz Stopczynski, Cordelia Schmid, Zhinghua Xi, Caroline Pantofaru

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
Original languageEnglish
Title of host publicationICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages4492-4496
Number of pages5
ISBN (Electronic)978-1-5090-6631-5
ISBN (Print)978-1-5090-6632-2
DOIs
Publication statusPublished - 9 Apr 2020
Event2020 IEEE International Conference on Acoustics, Speech, and Signal Processing - Barcelona, Spain
Duration: 4 May 20208 May 2020
Conference number: 45

Publication series

NameIEEE
ISSN (Print)1520-6149
ISSN (Electronic)2379-190X

Conference

Conference2020 IEEE International Conference on Acoustics, Speech, and Signal Processing
Abbreviated titleICASSP 2020
CountrySpain
CityBarcelona
Period4/05/208/05/20

Keywords

  • multimodal
  • audio-visual
  • active speaker detection
  • dataset

Fingerprint Dive into the research topics of 'Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection'. Together they form a unique fingerprint.

Cite this