Enhancing localization and tracking via cross-modal vision and wireless association
Description
TitleEnhancing localization and tracking via cross-modal vision and wireless association
Date Created2023
Other Date2023-01 (degree)
Extent128 pages : illustrations
DescriptionLocalization and tracking of pedestrians with proper user consent can result in safer environments and better communication with members of authority. In deep-learning-based unimodal applications, two schemes for person localization are considered: the vision modality and the wireless modality. In vision-based localization and recognition, pedestrians are tracked via drawn bounding boxes over a set of video frames with an assigned ID value. However, the approach can be unreliable in maintaining a consistent ID value of a detection following the exit and re-entering of the camera view due to occlusion, lighting changes, or outfit changes. Wireless-based localization and ranging, on the other hand, is capable of providing a steady ID value from the user’s smart device when communicating with a WiFi access point and does not suffer from the re-identification problem. Despite this, accuracy degradation may occur based on the number of deployed access points in the region, signal interference within the vicinity, and accumulated drift errors from the smartdevice over time. While vision-based approaches have difficulty supplying consistent identification and wireless-based approaches do not, wireless-based approaches suffer from a lack of spatial awareness without a complex setup. We investigate a multimodal approach for combining the two schemes in order to account for the two modalities’ weaknesses.
A sufficient fusion of the depth and camera measurements with wireless sensor data can enable stronger localization of individuals in a scene, result in improved re-identification following the reappearance of an individual in the camera field of view, and provide the means for assigning a identification label to the individual with explicit user consent. The cross-modal association of vision and wireless modalities results from mapping the two domains into a common latent space representation via deep learning, and exploiting the similarities within the representation to learn similarities between the two.
In this work, we outline three different approaches to deployable vision and wireless multimodal localization in real, complex environments. Unlike prior methods, we require the use of only one WiFi access point, are invariant to ambient light and clothing color changes, and minimize the latency to work in a real-time fashion. We study ViFiCon, a self-supervised approach to vision and wireless association through the creation of a novel joint banded image representation between depth and wireless ranging information, without the need for any hand-labeled or ground truth data. ViFiCon models the association task in two steps: a pretext, global synchronization which attempts to learn a scene-wide matching between a group of pedestrian’s vision and wireless data, and without any further training a downstream, one-to-one vision and wireless matching for an individual. ViFiCon achieves a 84.77% association accuracy when utilizing only depth and FTM features. We then consider two fully-supervised online association approaches, Vi-Fi and ViTag. Vi-Fi models the association problem as learning an affinity matrix across the two modalities, while ViTag learns a translation from the vision modality to the wireless modality to perform the matching task. Vi-Fi receives an average association accuracy of 84.97% accuracy with IMU, FTM, depth, and bounding box data supplied, while ViTag receives an 87.85% accuracy with the same features.
In order to facilitate the study of the approaches, we create a comprehensive, joint modality vision and wireless dataset across a number of outdoor locations and an indoor office facility location. We collect both video frames and depth data for visual information, and inertial measurement unit sensor data and wireless ranging information from WiFi fine time measurements over a set of 98 3-minute sequences. We outline the dataset collection process, as well as the creation and results of the models.
NoteM.S.
NoteIncludes bibliographical references
Genretheses
LanguageEnglish
CollectionSchool of Graduate Studies Electronic Theses and Dissertations
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.