DescriptionVisual and wireless sensing, two popular sensing modalities in multi-modal systems, have complementary characteristics. While providing richer and more accurate spatial measurements using RGBD information, vision sensors have drawbacks such as limited field of view, vulnerability to occlusion and poor performance in low illumination conditions. Wireless sensing, on the other hand, suffers less from appearance variation and can operate non-line-of-sight. But its ranging performance is degraded by multipath and shadowing in complex environments.
Sensor fusion and association in vision-wireless systems create "reality aware network", which allows the strength of each sensing modality to complement each other. In this thesis we propose to fuse wireless communication with visual sensing to improve system sensing range, tracking and localization. We design and implement different sensor fusion and association mechanisms for systems including Vehicle-to-Vehicle, Vehicle-to-Everything as well as localization and pedestrian tracking.
Advanced driver assistance systems benefit from complete understandings of traffic scenes around vehicles. Existing systems gather data through cameras and other sensors in vehicles but scene understanding can be limited due to the sensing range of sensors or occlusion from other objects. To explore how to gather information beyond the view of one vehicle, we propose a connected vehicle system that allows multiple moving vehicles to share perception data over vehicle-to-vehicle communications and collaboratively fuse the data into a more complete traffic scene.
Beyond fusing vision data via wireless communication, associating vision data with wireless data is another fundamental need in multi-modal applications. Successful vision-wireless association enables use-cases such as localization by fusing camera depth measurements with wireless ranging. It can also improve tracking and re-identification since wireless transmitters provide a stable identifier. Existing approaches of visual-wireless data association rely on appearance based fingerprinting, focus on controlled scenarios where participants are always visible and no passersby exist, or formulate optimization problems based on long sequences of measurement that needs post-processing. To achieve robust association between vision and wireless data, we propose a multi-modal system that leverages users' depth measurements, smartphone WiFi Fine Timing Measurements (FTM) and inertial measurement unit (IMU) sensor data to associate users detected on a camera footage with their corresponding smartphone identifiers.
Furthermore, we propose a multi-modal localization approach that leverages pedestrians' visual and phone data to accurately estimate their positions. Existing works of localization adopt filtering techniques to fuse multi-modal sensor data and produce location estimations. In our context, however, these algorithms become infeasible when a pedestrian's camera measurement is unavailable due to occlusion or camera's limited field of view. To address this limitation, we propose a Generative Adversarial Network that leverages the available data correspondences from vision and phone modalities to learn the underlined cross-modal linkage. With a pedestrian's phone measurements as input, the network is able to generate coordinate estimations that are more accurate than the phone's original GPS readings. We further show that the proposed model supports self-learning. The generated coordinates can be associated with pedestrian's bounding box coordinates to obtain additional camera-phone data correspondences during inference.