Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

Spandana Gella, Maria Lapata, Frank Keller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image. Just as textual word sense disambiguation is useful for a wide range of NLP tasks, visual sense disambiguation can be useful for multimodal tasks such as image retrieval, image description, and text illustration. We introduce VerSe, a new data set that augments existing multimodal data sets (COCO and TUHOI) with sense labels. We propose an unsupervised algorithm based on Lesk which performs visual sense disambiguation using textual, visual, or multimodal embeddings. We find that textual embeddings perform well when gold standard textual annotations (object labels and image descriptions) are available, while multimodal embeddings perform well on unannotated images. We also verify our findings by using the textual and multimodal embeddings as features in a supervised setting and analyse the performance of visual sense disambiguation task. VerSe is made publicly available and can be downloaded at: https://github.com/spandanagella/verse.
Original languageEnglish
Title of host publicationProceedings of NAACL-HLT 2016
PublisherAssociation for Computational Linguistics
Pages182-192
Number of pages11
ISBN (Print)978-1-941643-91-4
Publication statusPublished - Jun 2016
Event15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - San Diego, United States
Duration: 12 Jun 201617 Jun 2016
http://naacl.org/naacl-hlt-2016/
http://naacl.org/naacl-hlt-2016/

Conference

Conference15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Abbreviated titleNAACL HLT 2016
Country/TerritoryUnited States
CitySan Diego
Period12/06/1617/06/16
Internet address

Fingerprint

Dive into the research topics of 'Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings'. Together they form a unique fingerprint.

Cite this