This paper describes the Trento Universal Human Object Interaction dataset, TUHOI, which is dedicated to human object interactions in images. Recognizing human actions is an important yet challenging task. Most available datasets in this field are limited in numbers of actions and objects. A large dataset with various actions and human object interactions is needed for training and evaluating complicated and robust human action recognition systems, especially systems that combine knowledge learned from language and vision. We introduce an image collection with more than two thousand actions which have been annotated through crowdsourcing. We review publicly available datasets, describe the annotation process of our image collection and some statistics of this dataset. Finally, experimental results on the dataset including human action recognition based on objects and an analysis of the relation between human-object positions in images and prepositions in language are presented.
|Title of host publication||Proceedings of the Third Workshop on Vision and Language|
|Place of Publication||Dublin, Ireland|
|Publisher||Dublin City University and the Association for Computational Linguistics|
|Number of pages||8|
|Publication status||Published - 1 Aug 2014|