Staff View
Complex scene understanding with minimal supervision

Descriptive

TitleInfo
Title
Complex scene understanding with minimal supervision
Name (type = personal)
NamePart (type = family)
Akiva
NamePart (type = given)
Peri
DisplayForm
Peri Akiva
Role
RoleTerm (authority = RULIB)
author
Name (type = personal)
NamePart (type = family)
Dana
NamePart (type = given)
Kristin
DisplayForm
Kristin J Dana
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
chair
Name (type = personal)
NamePart (type = family)
Zhang
NamePart (type = given)
Yuqian
DisplayForm
Yuqian Zhang
Affiliation
Advisory Committee
Role
RoleTerm (authority = local)
member
Name (type = personal)
NamePart (type = family)
Yuan
NamePart (type = given)
Bo
DisplayForm
Bo Yuan
Affiliation
Advisory Committee
Role
RoleTerm (authority = local)
member
Name (type = personal)
NamePart (type = family)
Roy
NamePart (type = given)
Aditi
DisplayForm
Aditi Roy
Affiliation
Advisory Committee
Role
RoleTerm (authority = local)
member
Name (type = personal)
NamePart (type = family)
Dana
NamePart (type = given)
Kristin J
DisplayForm
Kristin J Dana
Affiliation
Advisory Committee
Role
RoleTerm (authority = local)
member
Name (type = corporate)
NamePart
Rutgers University
Role
RoleTerm (authority = RULIB)
degree grantor
Name (type = corporate)
NamePart
School of Graduate Studies
Role
RoleTerm (authority = RULIB)
school
TypeOfResource
Text
Genre (authority = marcgt)
theses
OriginInfo
DateCreated (encoding = w3cdtf); (qualifier = exact); (keyDate = yes)
2023
DateOther (encoding = w3cdtf); (type = degree); (qualifier = exact)
2023-01
CopyrightDate (encoding = w3cdtf); (qualifier = exact)
2023
Language
LanguageTerm (authority = ISO 639-3:2007); (type = text)
English
Abstract (type = abstract)
Understanding complex scenes is a fundamental necessity for any artificial agent system aiming to properly function under real world conditions. Those include autonomous vehicles needing to navigate in crowds or busy streets, or surveillance systems needing to track and recognize individuals in crowds. In this work, we explore the challenges affiliated with application of deep learning methodologies on complex scenes while utilizing minimal amounts of expensive, manually collected annotations. In general, the goal of this research is to maximize the applicable range of scene complexities, namely for the tasks of semantic segmentation and counting, while also minimizing annotations requirements, priori constraints, and distributional assumptions. This thesis presents the gradual removal of aforementioned supervisions, from using weak supervision, priori constraints, and data distributional assumptions, to using no supervision, no priori constraints, and no data distributional assumptions. In subsequent chapters we define the types of supervision, annotation costs, scene complexity, and selected deep learning tasks we aim to achieve. In particular, we develop the following methods: 1. Triple-S Network: This work presents a deep learning method for simultaneous segmentation and counting of cranberries to aid in yield estimation and sun exposure predictions. Notably, supervision is done using low cost center point annotations. The approach, named Triple-S Network, incorporates a three-part loss with shape priors to promote better fitting to objects of known shape typical in agricultural scenes. Our results improve overall segmentation performance by more than 6.74% and counting results by 22.91\% when compared to state-of-the-art methods. To train and evaluate the network, we collected the CRanberry Aerial Imagery Dataset (CRAID), the largest dataset of aerial drone imagery from cranberry fields. 2. Pseudo-Masks from Points: This work presents a generalizable approach to a wide range of dataset complexities, that is trainable from scratch, without any dependency on pre-trained backbones, classification, or separate refinement tasks. We utilize point annotations to generate reliable, on-the-fly pseudo-masks through refined and spatially filtered features. While our method requires point annotations that are only slightly more expensive than image-level annotations, we are to demonstrate state-of-the-art performance on benchmark datasets (PascalVOC 2012), as well as significantly outperform other state-of-the-art weakly supervised semantic segmentation methods on recent real-world datasets (CRAID, CityPersons, IAD, ADE20K, CityScapes) with up to 28.1% and 22.6% performance boosts compared to our single-stage and multi-stage baselines respectively. 3. H20-Network: This work presents a self supervised deep learning method to segment floods from satellites and aerial imagery by bridging domain gap between low and high latency satellite and coarse-to-fine label refinement. H2O-Net learns to synthesize signals highly correlative with water presence as a domain adaptation step for semantic segmentation in high resolution satellite imagery. Our work also proposes a self-supervision mechanism, which does not require any hand annotation, used during training to generate high quality ground truth data. We demonstrate that H2O-Net outperforms the state-of-the-art semantic segmentation methods on satellite imagery by 10\% and 12\% pixel accuracy and mIoU respectively for the task of flood segmentation. We emphasize the generalizability of our model by transferring model weights trained on satellite imagery to drone imagery, a highly different sensor and domain. 4. Material and Texture Representation Learning (MATTER): In this work, we present our material and texture based self-supervision method named MATTER (MATerial and TExture Representation Learning), which is inspired by classical material and texture methods. Material and texture can effectively describe any surface, including its tactile properties, color, and specularity. By extension, effective representation of material and texture can describe other semantic classes strongly associated with said material and texture. MATTER leverages multi-temporal, spatially aligned remote sensing imagery over unchanged regions to learn invariance to illumination and viewing angle as a mechanism to achieve consistency of material and texture representation. We show that our self-supervision pre-training method allows for up to 24.22% and 6.33% performance increase in unsupervised and fine-tuned setups, and up to 76% faster convergence on change detection, land cover classification, and semantic segmentation tasks. 5. Self-Supervised Object Detection from Egocentric Videos (DEVI): This work addresses the problem of self-supervised, class-agnostic object detection, which aims to locate all objects in a given view, regardless of category, without extit{extbf{any}} annotations or pre-training weights. Egocentric videos exhibit high scene complexity and irregular motion flows compared to typical video understanding tasks. Our method, self-supervised object extbf{d}etection from extbf{e}gocentric extbf{vi}deos (extbf{DEVI}), generalizes appearance-based methods to learn features that are category-specific and invariant to viewing angles and illumination conditions from highly ambiguous environments in an end-to-end manner. Our approach leverages typical human behavior and its egocentric perception to sample diverse views of the same objects for our multi-view and scale-regression loss functions. With our learned cluster residual module, we are able to effectively describe multi-category patches for better complex scene understanding. DEVI provides a boost in performance on recent egocentric datasets, with performance gains up to 4.11% AP50, 0.11% AR1, 1.32% AR10, and 5.03% AR100, while significantly reducing model complexity. We also demonstrate competitive performance on out-of-domain datasets without additional training or fine-tuning.
Subject (authority = RUETD)
Topic
Artificial intelligence
Subject (authority = RUETD)
Topic
Computer engineering
Subject (authority = RUETD)
Topic
Computer science
Subject (authority = local)
Topic
Complex scene understanding
Subject (authority = local)
Topic
Computer vision
Subject (authority = local)
Topic
Semantic segmentation
Subject (authority = local)
Topic
Supervision minimization
Subject (authority = local)
Topic
Video understanding
Subject (authority = local)
Topic
Weak supervision
RelatedItem (type = host)
TitleInfo
Title
Rutgers University Electronic Theses and Dissertations
Identifier (type = RULIB)
ETD
Identifier
http://dissertations.umi.com/gsnb.rutgers:12276
PhysicalDescription
InternetMediaType
application/pdf
InternetMediaType
text/xml
Extent
225 pages : illustrations
Note (type = degree)
Ph.D.
Note (type = bibliography)
Includes bibliographical references
RelatedItem (type = host)
TitleInfo
Title
School of Graduate Studies Electronic Theses and Dissertations
Identifier (type = local)
rucore10001600001
Location
PhysicalLocation (authority = marcorg); (displayLabel = Rutgers, The State University of New Jersey)
NjNbRU
Identifier (type = doi)
doi:10.7282/t3-psah-3q22
Back to the top

Rights

RightsDeclaration (ID = rulibRdec0006)
The author owns the copyright to this work.
RightsHolder (type = personal)
Name
FamilyName
Akiva
GivenName
Peri
Role
Copyright holder
RightsEvent
Type
Permission or license
DateTime (encoding = w3cdtf); (qualifier = exact); (point = start)
2023-02-23T11:59:01
AssociatedEntity
Name
Peri Akiva
Role
Copyright holder
Affiliation
Rutgers University. School of Graduate Studies
AssociatedObject
Type
License
Name
Author Agreement License
Detail
I hereby grant to the Rutgers University Libraries and to my school the non-exclusive right to archive, reproduce and distribute my thesis or dissertation, in whole or in part, and/or my abstract, in whole or in part, in and from an electronic format, subject to the release date subsequently stipulated in this submittal form and approved by my school. I represent and stipulate that the thesis or dissertation and its abstract are my original work, that they do not infringe or violate any rights of others, and that I make these grants as the sole owner of the rights to my thesis or dissertation and its abstract. I represent that I have obtained written permissions, when necessary, from the owner(s) of each third party copyrighted matter to be included in my thesis or dissertation and will supply copies of such upon request by my school. I acknowledge that RU ETD and my school will not distribute my thesis or dissertation or its abstract if, in their reasonable judgment, they believe all such rights have not been secured. I acknowledge that I retain ownership rights to the copyright of my work. I also retain the right to use all or part of this thesis or dissertation in future works, such as articles or books.
Copyright
Status
Copyright protected
Availability
Status
Open
Reason
Permission or license
Back to the top

Technical

RULTechMD (ID = TECHNICAL1)
ContentModel
ETD
OperatingSystem (VERSION = 5.1)
windows xp
CreatingApplication
Version
1.5
DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)
2022-12-22T17:33:09
DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)
2022-12-22T17:33:09
ApplicationName
pdfTeX-1.40.23
Back to the top
Version 8.5.3
Rutgers University Libraries - Copyright ©2023