Seeing What You're Told: Sentence-Guided Activity Recognition in Video

N. Siddharth, Andrei Barbu, Jeffrey Mark Siskind

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video: sentence-guided focus of attention, generation of sentential description, and query-based search, simply by leveraging the framework in different manners.
Original languageEnglish
Title of host publication2014 IEEE Conference on Computer Vision and Pattern Recognition
PublisherInstitute of Electrical and Electronics Engineers
Pages732-739
Number of pages8
ISBN (Electronic)978-1-4799-5118-5
DOIs
Publication statusPublished - 25 Sept 2014
Event2014 IEEE Conference on Computer Vision and Pattern Recognition - Columbus, United States
Duration: 24 Jun 201427 Jun 2014
http://www.pamitc.org/cvpr14/

Publication series

Name
PublisherIEEE
ISSN (Print)1063-6919
ISSN (Electronic)1063-6919

Conference

Conference2014 IEEE Conference on Computer Vision and Pattern Recognition
Abbreviated titleCVPR 2014
Country/TerritoryUnited States
CityColumbus
Period24/06/1427/06/14
Internet address

Fingerprint

Dive into the research topics of 'Seeing What You're Told: Sentence-Guided Activity Recognition in Video'. Together they form a unique fingerprint.

Cite this