AUTHOR=Ding Huijun , Du Zhou , Wang Ziwei , Xue Junqi , Wei Zhaoguo , Yang Kongjun , Jin Shan , Zhang Zhiguo , Wang Jianhong TITLE=IntervoxNet: a novel dual-modal audio-text fusion network for automatic and efficient depression detection from interviews JOURNAL=Frontiers in Physics VOLUME=12 YEAR=2024 URL=https://www.frontiersin.org/journals/physics/articles/10.3389/fphy.2024.1430035 DOI=10.3389/fphy.2024.1430035 ISSN=2296-424X ABSTRACT=
Depression is a prevalent mental health problem across the globe, presenting significant social and economic challenges. Early detection and treatment are pivotal in reducing these impacts and improving patient outcomes. Traditional diagnostic methods largely rely on subjective assessments by psychiatrists, underscoring the importance of developing automated and objective diagnostic tools. This paper presents IntervoxNet, a novel computeraided detection system designed specifically for analyzing interview audio. IntervoxNet incorporates a dual-modal approach, utilizing both the Audio Mel-Spectrogram Transformer (AMST) for audio processing and a hybrid model combining Bidirectional Encoder Representations from Transformers with a Convolutional Neural Network (BERT-CNN) for text analysis. Evaluated on the DAIC-WOZ database, IntervoxNet demonstrates excellent performance, achieving F1 score, recall, precision, and accuracy of 0.90, 0.92, 0.88, and 0.86 respectively, thereby surpassing existing state of the art methods. These results demonstrate IntervoxNet’s potential as a highly effective and efficient tool for rapid depression screening in interview settings.