DNN Multimodal Fusion Techniques for Predicting Video Sentiment

Jennifer Williams, Ramona Comanescu, Oana Radu, Leimin Tian

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We present our work on sentiment prediction using the benchmark MOSI dataset from the CMU-MultimodalDataSDK. Previous work on multimodal sentiment analysis have been focused on input-level feature fusion or decision-level fusion for multimodal fusion. Here, we propose an intermediate-level feature fusion, which merges weights from each modality (audio, video, and text) during training with subsequent
additional training. Moreover, we tested principle component analysis (PCA) for feature selection. We found that applying PCA increases unimodal performance, and multimodal fusion outperforms unimodal models. Our experiments show that our proposed intermediate-level feature fusion outperforms other fusion techniques, and it achieves the best performance with an overall binary accuracy of 74.0% on video+text modalities. Our work also improves feature selection for unimodal sentiment analysis, while proposing a novel and effective multimodal fusion architecture for this task.
Original languageEnglish
Title of host publicationProceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)
PublisherAssociation for Computational Linguistics
Pages64-72
Number of pages9
Publication statusPublished - Jul 2018
EventGrand Challenge and Workshop on Human Multimodal Language - Melbourne, Australia
Duration: 20 Jul 201820 Jul 2018
http://multicomp.cs.cmu.edu/acl2018multimodalchallenge/

Conference

ConferenceGrand Challenge and Workshop on Human Multimodal Language
Abbreviated titleACL 2018
Country/TerritoryAustralia
CityMelbourne
Period20/07/1820/07/18
Internet address

Fingerprint

Dive into the research topics of 'DNN Multimodal Fusion Techniques for Predicting Video Sentiment'. Together they form a unique fingerprint.

Cite this