DescriptionRecently, Machine learning, as a branch of artificial intelligence, has been playing an increasingly significant role in academia and industry fields, specifically in terms of image classification, object detection, and video analytics. Recent reports of bias in multimedia algorithms (e.g., lesser accuracy of face detection for women and persons of color) haveunderscored the urgent need to devise approaches which work equally well for different demographic groups. Hence, we posit that ensuring fairness in multimodal processing (e.g., equal performance irrespective of the gender of the user) is an important research challenge.
This dissertation proposes three novel contributions to the literature on fairness in multimedia processing. We first focus on the problem of face matching (i.e., matching low-resolution and high-resolution images of a person). We describe how adopting an adversarial deep learning-based approach allows for the model to maintain accuracy at face matching while also reducing demographic disparities compared to a baseline (non-adversarial deep learning) approach. The results motivate and pave the way for more accurate and fair face-matching algorithms.
Secondly, we consider multimodal cyberbullying detection and propose a fairness-aware fusion framework, which ensures both fairness and accuracy when combining data coming from multiple modalities. This Bayesian fusion framework is cognizant of the different confidence levels associated with each feature, the inter-dependencies between features, and, importantly, the fairness potential of each feature. Results of applying the framework to a multimodal (visual and text) cyberbullying detection problem demonstrate the value of the proposed framework in ensuring both accuracy and fairness.
Our third contribution revisits the problem of fairness in face matching and proposes a generative AI framework that can counter multiple kinds of bias (e.g., gender bias and age bias) at the same time. The framework consists of two major components: a variational auto-encoder (VAE) that converts the images into their more generic underlying representation, and second, a neural network architecture that uses the above representations to undertake multi-label classification. A generative approach is useful in ensuring that the system learns to deal with the underlying (latent) structure of the data for better generalizability and bias reduction. The approach is tested over a public image dataset and found to be effective at reducing bias while maintaining high accuracy.
In effect, the three contributions pave way for fairer multimedia information processing which would enable multiple security and personalization applications to provide equal opportunities to different demographic groups.