Description
TitleSpeech-based affective computing using attention with multimodal fusion
Date Created2020
Other Date2020-05 (degree)
Extent1 online resource (xvi, 119 pages) : illustrations
DescriptionMultimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is now a popular task with the recent rapid advancements in social media technology. Sentiment analysis and emotion recognition, both of which require applying subjective human concepts for detection, can be treated as two affective computing subtasks on different levels. A variety of data sources, including voice, facial expression, gesture, and linguistic content have been employed in sentiment analysis and emotion recognition. In this research, we focus on a multimodal structure to leverage the advantages of speech source on sentence-level data. Specifically, given an utterance, we consider the linguistic content and acoustic characteristics together to recognize the opinion or emotion. Our work is important and useful because speech is the most basic and commonly used form of human expression.
We first present two hybrid multimodal frameworks to predict human emotions and sentiments based on utterance-level spoken language. The hybrid deep multimodal system extracts the high-level features from both text and audio, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features. The system fuses all extracted features on utterance-level by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global fine-tuning of the entire structure. Since not all parts of the text and vocal signals contribute equally to the predictions, a specific word may change the entire sentimental state of text; a different vocal delivery may indicate inverse emotions despite having the same linguistic content. To learn such variation, we thus introduce the hybrid attention multimodal system that consists of both feature attention and modality attention to help the model focus on learning informative representations for both modality-specific feature extraction and model fusion.
Although demonstrated for the modality attention fusion, there is still a challenge to combine the textual and acoustical representations. Most previous works focused on combining multimodal information at a holistic level or fusing the extracted modality-specific features from entire utterances. However, to determine human meaning, it is critical to consider both the linguistic content of the word and how it is uttered. Aloud pitch on different words may convey inverse emotions, such as the emphasis on “hell” for anger but indicating happy on “great”. Synchronized attentive information on word-level across text and audio would then intuitively help recognize the sentiments and emotions. Therefore, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model’s synchronized attention over modalities offers visual interpretability.
We further propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Compared to previous work, the proposed model has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.
We finally introduced a novel human conversation analysis system, which uses a hierarchical encoder-decoder framework to better combine features extracted from linguistic modality, acoustic modality, and visual modality. The hierarchical structure first encodes the multimodal data into word-level features. The conversation-level encoder further selects important information from word-level features with temporal attention and represents all the conversation-level features as a vector. Considering that emotion and sentiment may change over a conversation and that multiple traits may be present simultaneously, our hierarchical decoder structure first decodes features at each time instance. Then, the attribute decoder will further decode the feature vector at each time instance into attributes at that time. we proposed word-level fusion with modality attention. Our system achieved state-of-the-art performance on three published datasets and outperformed others at generalization testing.
NotePh.D.
NoteIncludes bibliographical references
Genretheses, ETD doctoral
LanguageEnglish
CollectionSchool of Graduate Studies Electronic Theses and Dissertations
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.