AUTHOR=Xiong Haitao , Zhou Yuchen , Liu Jiaming , Cai Yuanyuan 

TITLE=Class-dependent and cross-modal memory network considering sentimental features for video-based captioning

JOURNAL=Frontiers in Psychology

VOLUME=Volume 14 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2023.1124369

DOI=10.3389/fpsyg.2023.1124369

ISSN=1664-1078

ABSTRACT=The video-based commonsense captioning task aims to add multiple commonsense descriptions (intention, effect, attribute) to video captions to understand video content better. Previous research mainly considered the interaction between previously predicted textual information and the next predicted textual information. But they did not consider the importance of cross-modal mapping. In this paper, we propose a combined framework called Class-dependent and Cross-modal Memory Network considering SENtimental features (CCMN-SEN) for Video-based Captioning to enhance commonsense caption generation. Firstly, we develop class-dependent memory for recording the alignment between video features and text. It only allows cross-modal interactions and generation on cross-modal matrices that share the same labels. Then, in order to understand the sentiments conveyed in the videos and generate accurate captions, we add sentiment features to facilitate commonsense caption generation. Experiment results demonstrate that our proposed CCMN-SEN significantly outperforms the state-of-the-art methods.