The final, formatted version of the article will be published soon.
ORIGINAL RESEARCH article
Front. Psychiatry
Sec. Computational Psychiatry
Volume 15 - 2024 |
doi: 10.3389/fpsyt.2024.1466507
This article is part of the Research Topic Machine Learning and Statistical Models: Unraveling Patterns and Enhancing Understanding of Mental Disorders View all 7 articles
DepITCM: an audio-visual method for detecting depression
Provisionally accepted- 1 Qilu University of Technology, Jinan, China
- 2 Shandong Mental Health Center, Jinan, Shandong Province, China
Introduction: Depression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.In this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail.Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.We conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we 1 Sample et al.
Keywords: Depression detection, multimodal, feature extraction, Multi-task learning, DepITCM
Received: 18 Jul 2024; Accepted: 26 Dec 2024.
Copyright: © 2024 Zhang, Liu, Wan, Fan, Chen, Wang, Zhang and Zheng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Qingxiang Wang, Shandong Mental Health Center, Jinan, 17035517, Shandong Province, China
Kaihong Zhang, Shandong Mental Health Center, Jinan, 17035517, Shandong Province, China
Yunshao Zheng, Shandong Mental Health Center, Jinan, 17035517, Shandong Province, China
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.