Speech-based affective computing using attention with multimodal fusion

Gu, Yue

doi:doi:10.7282/t3-s7ze-0304

RUcore: Rutgers University Community Repository

Search
- All
- Text
- Images
- Audio
- Video
Advanced Search | Help

Search all content in all RUcore collections.
Services
Collections

Help Contact Us My Account

Home

Resource

Speech-based affective computing using attention with multimodal fusion

PDF

PDF format is widely accepted and good for printing.

Plug-in required

PDF-1(4.40 MB)

Citation & Export

View Usage Statistics

Staff View

Citation & Export
Hide

Simple citation

Gu, Yue. Speech-based affective computing using attention with multimodal fusion. Retrieved from https://doi.org/doi:10.7282/t3-s7ze-0304

Export

Click here for information about Citation Management Tools at Rutgers.

Statistics
Hide

Description

TitleSpeech-based affective computing using attention with multimodal fusion

NameGu, Yue (author); Marsic, Ivan (chair); Wei, Sheng (internal member); Sarwate, Anand D. (internal member); Zhang, Yongfeng (outside member); Rutgers University; School of Graduate Studies

Date Created2020

Other Date2020-05 (degree)

SubjectHuman-computer interaction, Electrical and Computer Engineering

Extent1 online resource (xvi, 119 pages) : illustrations

DescriptionMultimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is now a popular task with the recent rapid advancements in social media technology. Sentiment analysis and emotion recognition, both of which require applying subjective human concepts for detection, can be treated as two affective computing subtasks on different levels. A variety of data sources, including voice, facial expression, gesture, and linguistic content have been employed in sentiment analysis and emotion recognition. In this research, we focus on a multimodal structure to leverage the advantages of speech source on sentence-level data. Specifically, given an utterance, we consider the linguistic content and acoustic characteristics together to recognize the opinion or emotion. Our work is important and useful because speech is the most basic and commonly used form of human expression.

We first present two hybrid multimodal frameworks to predict human emotions and sentiments based on utterance-level spoken language. The hybrid deep multimodal system extracts the high-level features from both text and audio, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features. The system fuses all extracted features on utterance-level by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global fine-tuning of the entire structure. Since not all parts of the text and vocal signals contribute equally to the predictions, a specific word may change the entire sentimental state of text; a different vocal delivery may indicate inverse emotions despite having the same linguistic content. To learn such variation, we thus introduce the hybrid attention multimodal system that consists of both feature attention and modality attention to help the model focus on learning informative representations for both modality-specific feature extraction and model fusion.

Although demonstrated for the modality attention fusion, there is still a challenge to combine the textual and acoustical representations. Most previous works focused on combining multimodal information at a holistic level or fusing the extracted modality-specific features from entire utterances. However, to determine human meaning, it is critical to consider both the linguistic content of the word and how it is uttered. Aloud pitch on different words may convey inverse emotions, such as the emphasis on “hell” for anger but indicating happy on “great”. Synchronized attentive information on word-level across text and audio would then intuitively help recognize the sentiments and emotions. Therefore, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model’s synchronized attention over modalities offers visual interpretability.

We further propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Compared to previous work, the proposed model has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.

We finally introduced a novel human conversation analysis system, which uses a hierarchical encoder-decoder framework to better combine features extracted from linguistic modality, acoustic modality, and visual modality. The hierarchical structure first encodes the multimodal data into word-level features. The conversation-level encoder further selects important information from word-level features with temporal attention and represents all the conversation-level features as a vector. Considering that emotion and sentiment may change over a conversation and that multiple traits may be present simultaneously, our hierarchical decoder structure first decodes features at each time instance. Then, the attribute decoder will further decode the feature vector at each time instance into attributes at that time. we proposed word-level fusion with modality attention. Our system achieved state-of-the-art performance on three published datasets and outperformed others at generalization testing.

NotePh.D.

NoteIncludes bibliographical references

Genretheses, ETD doctoral

Persistent URLhttps://doi.org/doi:10.7282/t3-s7ze-0304

LanguageEnglish

CollectionSchool of Graduate Studies Electronic Theses and Dissertations

Organization NameRutgers, The State University of New Jersey

RightsThe author owns the copyright to this work.

Version 8.5.5

Citation & ExportHide

Simple citation

Export

StatisticsHide

Description

Citation & Export
Hide

Statistics
Hide