DescriptionScene graph parsing aims at understanding an image as a graph where vertices are visual objects (potentially with attributes) and edges are visual relationships among objects. This task is commonly seen as an extension to the object detection task where objects are detected individually, while the former requires recognizing relationships between object pairs. Therefore, scene graphs are usually seen as a better semantic representation of images for visual reasoning. In thesis we start with an inherent issue lying in scene graph parsing: the unbearable quadratic complexity of relationship detection. We develop an efficient model that effectively reduces the complexity from quadratic down to quasi-linear and show clear superiority over intuitive and strong baselines. Then we introduce two salient issues that naturally occur in scene graphs: Ambiguity in the language dimension and ambiguity in the visual dimension. The first happens when the vocabulary of objects and relationships are significantly large, and the second happens when multiple vertices or edges in a scene graph are from the same category and confuse the model to recognize the correct relational pairing. We propose two models that tackle these two problems separately, where the first model utilizes learnable embeddings to handle the ambiguity in the language dimension, while the second adds three types of losses that we design to for the model to learn to discriminate correct instances against confusing and hard negative instances. At last, with an accurately parsed scene graph, we discuss the topic of using scene graphs as richer feature and deeper knowledge of the input visual signals for better visual-semantic cross-modal reasoning. We design and develop a model that follows such logic and apply it on the video story understanding task, which achieves satisfying advantage over strong baseline models. In summary, we claim that scene graphs can be accurately and efficiently obtained by our models, and that we can build a sophisticated system that employs scene graphs for more explicit and interpretable cross-modal understanding.