AUTHOR=Ilias Loukas , Askounis Dimitris TITLE=Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts JOURNAL=Frontiers in Aging Neuroscience VOLUME=14 YEAR=2022 URL=https://www.frontiersin.org/journals/aging-neuroscience/articles/10.3389/fnagi.2022.830943 DOI=10.3389/fnagi.2022.830943 ISSN=1663-4365 ABSTRACT=
Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitations. Specifically, the main limitation is pertinent to the way the modalities of speech and transcripts are combined in a single neural network. Existing research works add/concatenate the image and text representations, employ majority voting approaches or average the predictions after training many textual and speech models separately. To address these limitations, in this article we present some new methods to detect AD patients and predict the Mini-Mental State Examination (MMSE) scores in an end-to-end trainable manner consisting of a combination of BERT, Vision Transformer, Co-Attention, Multimodal Shifting Gate, and a variant of the self-attention mechanism. Specifically, we convert audio to Log-Mel spectrograms, their delta, and delta-delta (acceleration values). First, we pass each transcript and image through a BERT model and Vision Transformer, respectively, adding a co-attention layer at the top, which generates image and word attention simultaneously. Secondly, we propose an architecture, which integrates multimodal information to a BERT model