Staff View
Speech-based affective computing using attention with multimodal fusion

Descriptive

TitleInfo
Title
Speech-based affective computing using attention with multimodal fusion
Name (type = personal)
NamePart (type = family)
Gu
NamePart (type = given)
Yue
DisplayForm
Yue Gu
Role
RoleTerm (authority = RULIB)
author
Name (type = personal)
NamePart (type = family)
Marsic
NamePart (type = given)
Ivan
DisplayForm
Ivan Marsic
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
chair
Name (type = personal)
NamePart (type = family)
Wei
NamePart (type = given)
Sheng
DisplayForm
Sheng Wei
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
internal member
Name (type = personal)
NamePart (type = family)
Sarwate
NamePart (type = given)
Anand D.
DisplayForm
Anand D. Sarwate
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
internal member
Name (type = personal)
NamePart (type = family)
Zhang
NamePart (type = given)
Yongfeng
DisplayForm
Yongfeng Zhang
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
outside member
Name (type = corporate)
NamePart
Rutgers University
Role
RoleTerm (authority = RULIB)
degree grantor
Name (type = corporate)
NamePart
School of Graduate Studies
Role
RoleTerm (authority = RULIB)
school
TypeOfResource
Text
Genre (authority = marcgt)
theses
OriginInfo
DateCreated (encoding = w3cdtf); (keyDate = yes); (qualifier = exact)
2020
DateOther (encoding = w3cdtf); (qualifier = exact); (type = degree)
2020-05
CopyrightDate (encoding = w3cdtf); (qualifier = exact)
2020
Language
LanguageTerm (authority = ISO 639-3:2007); (type = text)
English
Abstract (type = abstract)
Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is now a popular task with the recent rapid advancements in social media technology. Sentiment analysis and emotion recognition, both of which require applying subjective human concepts for detection, can be treated as two affective computing subtasks on different levels. A variety of data sources, including voice, facial expression, gesture, and linguistic content have been employed in sentiment analysis and emotion recognition. In this research, we focus on a multimodal structure to leverage the advantages of speech source on sentence-level data. Specifically, given an utterance, we consider the linguistic content and acoustic characteristics together to recognize the opinion or emotion. Our work is important and useful because speech is the most basic and commonly used form of human expression.

We first present two hybrid multimodal frameworks to predict human emotions and sentiments based on utterance-level spoken language. The hybrid deep multimodal system extracts the high-level features from both text and audio, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features. The system fuses all extracted features on utterance-level by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global fine-tuning of the entire structure. Since not all parts of the text and vocal signals contribute equally to the predictions, a specific word may change the entire sentimental state of text; a different vocal delivery may indicate inverse emotions despite having the same linguistic content. To learn such variation, we thus introduce the hybrid attention multimodal system that consists of both feature attention and modality attention to help the model focus on learning informative representations for both modality-specific feature extraction and model fusion.

Although demonstrated for the modality attention fusion, there is still a challenge to combine the textual and acoustical representations. Most previous works focused on combining multimodal information at a holistic level or fusing the extracted modality-specific features from entire utterances. However, to determine human meaning, it is critical to consider both the linguistic content of the word and how it is uttered. Aloud pitch on different words may convey inverse emotions, such as the emphasis on “hell” for anger but indicating happy on “great”. Synchronized attentive information on word-level across text and audio would then intuitively help recognize the sentiments and emotions. Therefore, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model’s synchronized attention over modalities offers visual interpretability.

We further propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Compared to previous work, the proposed model has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.

We finally introduced a novel human conversation analysis system, which uses a hierarchical encoder-decoder framework to better combine features extracted from linguistic modality, acoustic modality, and visual modality. The hierarchical structure first encodes the multimodal data into word-level features. The conversation-level encoder further selects important information from word-level features with temporal attention and represents all the conversation-level features as a vector. Considering that emotion and sentiment may change over a conversation and that multiple traits may be present simultaneously, our hierarchical decoder structure first decodes features at each time instance. Then, the attribute decoder will further decode the feature vector at each time instance into attributes at that time. we proposed word-level fusion with modality attention. Our system achieved state-of-the-art performance on three published datasets and outperformed others at generalization testing.
Subject (authority = LCSH)
Topic
Human-computer interaction
Subject (authority = RUETD)
Topic
Electrical and Computer Engineering
RelatedItem (type = host)
TitleInfo
Title
Rutgers University Electronic Theses and Dissertations
Identifier (type = RULIB)
ETD
Identifier
ETD_10715
PhysicalDescription
Form (authority = gmd)
InternetMediaType
application/pdf
InternetMediaType
text/xml
Extent
1 online resource (xvi, 119 pages) : illustrations
Note (type = degree)
Ph.D.
Note (type = bibliography)
Includes bibliographical references
RelatedItem (type = host)
TitleInfo
Title
School of Graduate Studies Electronic Theses and Dissertations
Identifier (type = local)
rucore10001600001
Location
PhysicalLocation (authority = marcorg); (displayLabel = Rutgers, The State University of New Jersey)
NjNbRU
Identifier (type = doi)
doi:10.7282/t3-s7ze-0304
Genre (authority = ExL-Esploro)
ETD doctoral
Back to the top

Rights

RightsDeclaration (ID = rulibRdec0006)
The author owns the copyright to this work.
RightsHolder (type = personal)
Name
FamilyName
Gu
GivenName
Yue
Role
Copyright Holder
RightsEvent
Type
Permission or license
DateTime (encoding = w3cdtf); (qualifier = exact); (point = start)
2020-04-07 15:11:00
AssociatedEntity
Name
Yue Gu
Role
Copyright holder
Affiliation
Rutgers University. School of Graduate Studies
AssociatedObject
Type
License
Name
Author Agreement License
Detail
I hereby grant to the Rutgers University Libraries and to my school the non-exclusive right to archive, reproduce and distribute my thesis or dissertation, in whole or in part, and/or my abstract, in whole or in part, in and from an electronic format, subject to the release date subsequently stipulated in this submittal form and approved by my school. I represent and stipulate that the thesis or dissertation and its abstract are my original work, that they do not infringe or violate any rights of others, and that I make these grants as the sole owner of the rights to my thesis or dissertation and its abstract. I represent that I have obtained written permissions, when necessary, from the owner(s) of each third party copyrighted matter to be included in my thesis or dissertation and will supply copies of such upon request by my school. I acknowledge that RU ETD and my school will not distribute my thesis or dissertation or its abstract if, in their reasonable judgment, they believe all such rights have not been secured. I acknowledge that I retain ownership rights to the copyright of my work. I also retain the right to use all or part of this thesis or dissertation in future works, such as articles or books.
Copyright
Status
Copyright protected
Availability
Status
Open
Reason
Permission or license
Back to the top

Technical

RULTechMD (ID = TECHNICAL1)
ContentModel
ETD
OperatingSystem (VERSION = 5.1)
windows xp
CreatingApplication
Version
1.5
ApplicationName
pdfTeX-1.40.19
DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)
2020-03-27T11:49:19
DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)
2020-03-27T11:49:19
Back to the top
Version 8.5.5
Rutgers University Libraries - Copyright ©2024