Abstract / Description of output
We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video: sentence-guided focus of attention, generation of sentential description, and query-based search, simply by leveraging the framework in different manners.
Original language | English |
---|---|
Title of host publication | 2014 IEEE Conference on Computer Vision and Pattern Recognition |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 732-739 |
Number of pages | 8 |
ISBN (Electronic) | 978-1-4799-5118-5 |
DOIs | |
Publication status | Published - 25 Sept 2014 |
Event | 2014 IEEE Conference on Computer Vision and Pattern Recognition - Columbus, United States Duration: 24 Jun 2014 → 27 Jun 2014 http://www.pamitc.org/cvpr14/ |
Publication series
Name | |
---|---|
Publisher | IEEE |
ISSN (Print) | 1063-6919 |
ISSN (Electronic) | 1063-6919 |
Conference
Conference | 2014 IEEE Conference on Computer Vision and Pattern Recognition |
---|---|
Abbreviated title | CVPR 2014 |
Country/Territory | United States |
City | Columbus |
Period | 24/06/14 → 27/06/14 |
Internet address |