Scene graph parsing and its application in cross-modal reasoning tasks

Zhang, Ji

doi:doi:10.7282/t3-ka2q-b984

RUcore: Rutgers University Community Repository

Search
- All
- Text
- Images
- Audio
- Video
Advanced Search | Help

Search all content in all RUcore collections.
Services
Collections

Help Contact Us My Account

Home

Resource

Scene graph parsing and its application in cross-modal reasoning tasks

PDF

PDF format is widely accepted and good for printing.

Plug-in required

PDF-1(37.33 MB)

Citation & Export

View Usage Statistics

Staff View

Citation & Export
Hide

Simple citation

Zhang, Ji. Scene graph parsing and its application in cross-modal reasoning tasks. Retrieved from https://doi.org/doi:10.7282/t3-ka2q-b984

Export

Click here for information about Citation Management Tools at Rutgers.

Statistics
Hide

Description

TitleScene graph parsing and its application in cross-modal reasoning tasks

NameZhang, Ji (author); Elgammal, Ahmed (chair); Zhang, Yongfeng (internal member); de Melo, Gerard (internal member); Li, Ang (outside member); Rutgers University; School of Graduate Studies

Date Created2020

Other Date2020-05 (degree)

SubjectImage processing, Computer Science

Extent1 online resource (ix, 105 pages) : illustrations

DescriptionScene graph parsing aims at understanding an image as a graph where vertices are visual objects (potentially with attributes) and edges are visual relationships among objects. This task is commonly seen as an extension to the object detection task where objects are detected individually, while the former requires recognizing relationships between object pairs. Therefore, scene graphs are usually seen as a better semantic representation of images for visual reasoning. In thesis we start with an inherent issue lying in scene graph parsing: the unbearable quadratic complexity of relationship detection. We develop an efficient model that effectively reduces the complexity from quadratic down to quasi-linear and show clear superiority over intuitive and strong baselines. Then we introduce two salient issues that naturally occur in scene graphs: Ambiguity in the language dimension and ambiguity in the visual dimension. The first happens when the vocabulary of objects and relationships are significantly large, and the second happens when multiple vertices or edges in a scene graph are from the same category and confuse the model to recognize the correct relational pairing. We propose two models that tackle these two problems separately, where the first model utilizes learnable embeddings to handle the ambiguity in the language dimension, while the second adds three types of losses that we design to for the model to learn to discriminate correct instances against confusing and hard negative instances. At last, with an accurately parsed scene graph, we discuss the topic of using scene graphs as richer feature and deeper knowledge of the input visual signals for better visual-semantic cross-modal reasoning. We design and develop a model that follows such logic and apply it on the video story understanding task, which achieves satisfying advantage over strong baseline models. In summary, we claim that scene graphs can be accurately and efficiently obtained by our models, and that we can build a sophisticated system that employs scene graphs for more explicit and interpretable cross-modal understanding.

NotePh.D.

NoteIncludes bibliographical references

Genretheses, ETD doctoral

Persistent URLhttps://doi.org/doi:10.7282/t3-ka2q-b984

LanguageEnglish

CollectionSchool of Graduate Studies Electronic Theses and Dissertations

Organization NameRutgers, The State University of New Jersey

RightsThe author owns the copyright to this work.

Version 8.5.5

Citation & ExportHide

Simple citation

Export

StatisticsHide

Description

Citation & Export
Hide

Statistics
Hide