Cluster Encoding for Modelling Temporal Variation in Video

N. Rostamzadeh, J. R. R Uijlings, I. Mironica, M. K. Abadi, B. Ionescu, N. Sebe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Classical Bag-of-Words methods represent videos by modeling the variation of local visual descriptors throughout the video. In this approach they mix variation in time and space indiscriminately while these dimensions are fundamentally different. Therefore, in this paper we present a novel method for video representation which explicitly captures temporal variation over time. We do this by first creating frame-based features using standard Bag-of-Words techniques. To model the variation in time over these frame-based features, we introduce Hard and Soft Cluster Encoding, novel techniques to model variation inspired by the Fisher Kernel [1] and VLAD [2]. Results on the Rochester ADL [3] and Blip10k [4] datasets show that our method yields improvements of respectively 6.6% and 7.4% over our baselines. On Blip10k we outperform the state-of-the-art by 3.6% when using only visual features.
Original languageEnglish
Title of host publicationImage Processing (ICIP), 2015 IEEE International Conference on
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages3640 - 3644
Number of pages5
Publication statusPublished - Sept 2015


Dive into the research topics of 'Cluster Encoding for Modelling Temporal Variation in Video'. Together they form a unique fingerprint.

Cite this