AUTHOR=Ding Huijun , Du Zhou , Wang Ziwei , Xue Junqi , Wei Zhaoguo , Yang Kongjun , Jin Shan , Zhang Zhiguo , Wang Jianhong 

TITLE=IntervoxNet: a novel dual-modal audio-text fusion network for automatic and efficient depression detection from interviews

JOURNAL=Frontiers in Physics

VOLUME=Volume 12 - 2024

YEAR=2024

URL=https://www.frontiersin.org/journals/physics/articles/10.3389/fphy.2024.1430035

DOI=10.3389/fphy.2024.1430035

ISSN=2296-424X

ABSTRACT=Depression is a prevalent mental health problem across the globe, presenting significant social and economic challenges. Early detection and treatment are pivotal in reducing these impacts and improving patient outcomes. Traditional diagnostic methods largely rely on subjective assessments by psychiatrists, underscoring the importance of developing automated and objective diagnostic tools. This paper presents IntervoxNet, a novel computeraided detection system designed specifically for analyzing interview audio. IntervoxNet incorporates a dual-modal approach, utilizing both the Audio Mel-Spectrogram Transformer (AMST) for audio processing and a hybrid model combining Bidirectional Encoder Representations from Transformers with a Convolutional Neural Network (BERT-CNN) for text analysis. Evaluated on the DAIC-WOZ database, IntervoxNet demonstrates excellent performance, achieving F1 score, recall, precision, and accuracy of 0.90, 0.92, 0.88, and 0.86 respectively, thereby surpassing existing state of the art methods. These results demonstrate IntervoxNet's potential as a highly effective and efficient tool for rapid depression screening in interview settings.