AUTHOR=Zhang Hua , Gou Ruoyun , Shang Jili , Shen Fangyao , Wu Yifan , Dai Guojun 

TITLE=Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

JOURNAL=Frontiers in Physiology

VOLUME=Volume 12 - 2021

YEAR=2021

URL=https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2021.643202

DOI=10.3389/fphys.2021.643202

ISSN=1664-042X

ABSTRACT=Speech emotion recognition (SER) is a diﬀicult and challenging task because of the affective variances betweendifferent speakers. The performances of SER are extremely reliant on the extracted features from speechsignals. How to establish features extracting and classification model effectively is still under intense research. Inthis paper, we proposed a new method for SER based on Deep Convolution Neural Network and BidirectionalLong Short-Term Memory with Attention model (ADCNN-BLSTM). We first preprocess the speech samples bydata enhancement and balancing datasets. Secondly, we extract three-channel of log Mel-spectrograms (static,delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied togenerate the segment-level features, we stack these features of a sentence into utterance-level features. Next,we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by anattention layer which can focus on emotionally relevant features. Finally, the learned high-level emotionalfeatures are fed to the deep neural network (DNN) to predict the final emotion. Experiments on EMO-DB andIEMOCAP database obtain the unweighted average recall (UAR) of 87.86% and 68.50% respectively, whichare better than most of popular methods and demonstrate the effectiveness of our proposed method for SER.