Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The amount of image data available on the web is growing rapidly: on Facebook alone, 350 million new images are uploaded every day. Making sense of this data requires new ways of efficiently indexing, annotating, and querying such enormous collections. Research in computer vision has tackled this problem by developing algorithms for localizing and labeling objects in images. Object classification algorithms have been recently scaled up to work on thousands of object classes based on the ImageNet database.

The next frontier in analyzing images is to go beyond classifying objects: to truly understand a visual scene, we need to identify how the objects in that scene relate to each other, which actions and events they are involved in, and ultimately recognize the intentions of the actors depicted in the scene. The key to achieving this goal is to develop methods for parsing images into structured representations. A number of approaches have recently been proposed in the literature, including Visual Dependency Representations, Scene Graphs, and Scene Description Graphs. All of these models represent an image as a structured collection of objects, attributes, and relations between them.

In this presentation, we will focus on Visual Dependency Representations (VDRs), the only approach to image structure that is explicitly multimodal. VDRs start from the observation that images typically do not exist in isolation, but co-occur with textual data such as comments, captions, or tags; well-established techniques exist for extracting structure from such textual data. The VDR model exploits this observation by positing an image structure that links objects through geometric relations. Text accompanying the image can be parsed into a syntactic dependency graph, and the two representations are aligned, yielding a multimodal graph (see Figure 1). Well-established synchronous parsing techniques from machine translation can be applied to this task, and resulting VDRs are useful for image description and retrieval.

Parsing images into multimodal graph structures is an important step towards image understanding. However, for full understanding, representing the semantics of the image is also crucial. For example, the images in Figure 2 can all be described using the verb "play" (and presumably are assigned similar VDRs). However, a different meaning (verb sense) of "play" is evoked by each image. This has led to the new task of visual verb sense disambiguation: given an image and a verb, assign the correct sense of the verb, i.e., the one that corresponds to the action depicted in the image. We propose an unsupervised algorithm based on Lesk which performs visual sense disambiguation using textual, visual, and multimodal embeddings. In this presentation, we will discuss how the two tasks of VDR parsing and visual verb disambiguation can be combined to yield more complete syntactico-semantic image representations, which can then underpin applications such as image retrieval, image description, and visual question answering.
Original languageEnglish
Title of host publicationProceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion
Place of PublicationNew York, NY, USA
PublisherACM
Pages35-36
Number of pages2
ISBN (Print)978-1-4503-4519-4
DOIs
Publication statusPublished - 16 Oct 2016
Event2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion - Amsterdam, Netherlands
Duration: 15 Oct 201619 Oct 2016
http://www.acmmm.org/2016/

Publication series

NameiV38;L-MM '16
PublisherACM

Conference

Conference2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion
Abbreviated titleACM MULTIMEDIA 2016
CountryNetherlands
CityAmsterdam
Period15/10/1619/10/16
Internet address

Fingerprint Dive into the research topics of 'Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings'. Together they form a unique fingerprint.

Cite this