Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing

Wang, Jingyi

doi:10.3389/fnins.2024.1449527

ORIGINAL RESEARCH article

Front. Neurosci., 07 August 2024

Sec. Visual Neuroscience

Volume 18 - 2024 | https://doi.org/10.3389/fnins.2024.1449527

This article is part of the Research TopicAdvances in Computer Vision: From Deep Learning Models to Practical ApplicationsView all 12 articles

Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing

Jingyi Wang^*

School of Mass-communication and Advertising, Tongmyong University, Busan, Republic of Korea

Facial expression recognition (FER) plays a crucial role in affective computing, enhancing human-computer interaction by enabling machines to understand and respond to human emotions. Despite advancements in deep learning, current FER systems often struggle with challenges such as occlusions, head pose variations, and motion blur in natural environments. These challenges highlight the need for more robust FER solutions. To address these issues, we propose the Attention-Enhanced Multi-Layer Transformer (AEMT) model, which integrates a dual-branch Convolutional Neural Network (CNN), an Attentional Selective Fusion (ASF) module, and a Multi-Layer Transformer Encoder (MTE) with transfer learning. The dual-branch CNN captures detailed texture and color information by processing RGB and Local Binary Pattern (LBP) features separately. The ASF module selectively enhances relevant features by applying global and local attention mechanisms to the extracted features. The MTE captures long-range dependencies and models the complex relationships between features, collectively improving feature representation and classification accuracy. Our model was evaluated on the RAF-DB and AffectNet datasets. Experimental results demonstrate that the AEMT model achieved an accuracy of 81.45% on RAF-DB and 71.23% on AffectNet, significantly outperforming existing state-of-the-art methods. These results indicate that our model effectively addresses the challenges of FER in natural environments, providing a more robust and accurate solution. The AEMT model significantly advances the field of FER by improving the robustness and accuracy of emotion recognition in complex real-world scenarios. This work not only enhances the capabilities of affective computing systems but also opens new avenues for future research in improving model efficiency and expanding multimodal data integration.

1 Introduction

In the field of affective computing, facial expression recognition (FER) has garnered significant attention due to its natural and powerful means of conveying human emotions. FER systems have critical applications in psychology research, human-computer interaction, driver fatigue monitoring, and more. However, there are still many challenges to facial expression recognition in natural environments. Factors such as occlusion, changes in head pose (Sun et al., 2021; Xu et al., 2022), facial distortion and motion blurring exacerbate the challenges to such recognition, as shown in Figure 1. These factors lead to significant changes in facial appearance, complicating the task of accurately recognizing expressions and causing traditional recognition methods in laboratory settings to perform poorly in real-world applications (Borgalli and Surve, 2022). Therefore, how to achieve efficient and accurate facial expression recognition in complex environments has become an urgent problem in the field (Zeng et al., 2019; Li et al., 2020b).

Figure 1

Figure 1. Samples from the AffectNet (Sun et al., 2021) and RAF-DB (Li et al., 2017) datasets, emphasizing the variations in head poses, occlusions, and other unconstrained conditions present in real-world images. AffectNet includes eight expression labels, incorporating the contempt category, while RAF-DB is annotated with seven basic expressions and additional compound expressions.

The advent of deep learning has provided new opportunities for FER. Convolutional neural networks (CNNs) and other deep learning models have made significant strides in feature extraction and classification accuracy. Deep learning models automatically learn complex features from data, enhancing the accuracy and robustness of FER. For example, Tang et al. (2019) proposed a CNN model that significantly improved performance by replacing the softmax layer with a linear support vector machine (SVM) for classification. Similarly, Kim et al. developed a deep locality-preserving CNN (DCNN-RF) method to enhance feature discriminativeness (Li et al., 2019; Kim et al., 2023). Despite these advancements, the performance of deep learning methods in natural environments still leaves much to be desired (Kollias and Zafeiriou, 2019).

Currently, the application of deep learning in natural environments faces several challenges, including insufficient data, weak model generalization, and difficulty in feature extraction under complex conditions. Most existing methods are trained and tested in controlled environments, performing poorly in real-world scenarios. Additionally, the limited quantity and quality of available datasets hinder the effective training of deep learning models, resulting in unstable performance in natural environments (Wang X. et al., 2020; Zeng et al., 2020). Existing methods often fail to account for the diversity of real-world conditions, such as varying lighting, occlusions, and head poses, leading to reduced robustness and accuracy.

To address these challenges, this paper proposes an improved visual Transformer model that combines attention mechanisms and multi-layer Transformer encoders, incorporating transfer learning to leverage the advantages of pre-trained models on large-scale datasets (Liu et al., 2021). Specifically, the proposed method involves two main steps: first, using a dual-branch CNN to extract RGB and LBP (Local Binary Pattern) features, which are then fused using an ASF module. The ASF module integrates global and local attention mechanisms to effectively combine various features, enhancing feature representation richness (Zhao et al., 2020; Zhang et al., 2021). Second, a multi-layer Transformer encoder models the global relationships of the fused features, and the pre-trained model is fine-tuned to improve adaptability to new datasets. The Transformer encoder, through multi-head self-attention mechanisms, captures long-range dependencies among features, thereby improving recognition capabilities (Ma et al., 2021).

The proposed model addresses the limitations of existing methods by enhancing feature extraction and improving generalization. The dual-branch CNN captures both color and texture information through RGB and LBP features, addressing the issue of insufficient feature representation. The ASF module further enhances this by selectively focusing on the most relevant features, improving the model's ability to handle occlusions and varying head poses. The multi-layer Transformer encoder with transfer learning leverages pre-trained models to improve performance on smaller datasets, addressing the challenge of insufficient training data and enhancing model generalization.

The goal of this study is to improve the accuracy and robustness of FER in natural environments by combining attention mechanisms, transfer learning, and Transformer models, providing an effective solution for affective computing. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods on multiple public datasets, achieving new performance benchmarks. For instance, testing on the RAF-DB, FERPlus, and AffectNet datasets shows that the proposed method surpasses existing methods in accuracy, achieving new performance highs. Furthermore, the proposed method exhibits excellent generalization capabilities in cross-dataset evaluations, validating its applicability in diverse environments (Jiang et al., 2020).

In summary, this paper introduces a novel FER method that leverages transfer learning and improved attention mechanisms. This approach not only enhances recognition accuracy but also improves robustness and generalization in complex environments, providing new insights and technical support for the development of affective computing. With the advent of larger datasets and more powerful computational resources, this method is expected to further advance, laying the groundwork for more intelligent and humanized affective computing systems.

In conclusion, our contributions are as follows:

1. Novel integration of attention mechanisms and transformers: We have developed a new model that integrates attention mechanisms with multi-layer Transformer encoders. This combination enhances the ability to capture global and local features, improving the accuracy and robustness of facial expression recognition in natural environments.

2. Incorporation of transfer learning: By incorporating transfer learning, our model leverages pre-trained features from large-scale datasets, significantly improving performance and training efficiency on smaller, task-specific datasets. This approach also enhances the model's adaptability to diverse data conditions.

3. Comprehensive evaluation and validation: We conducted extensive experiments across multiple public datasets (RAF-DB, FERPlus, and AffectNet), demonstrating that our proposed method achieves state-of-the-art performance. Additionally, we validated our model's generalization capabilities through cross-dataset evaluations, proving its effectiveness in real-world applications.

To provide a clear structure for the reader, we outline the organization of our paper as follows: The first section is the introduction, providing an overview of the research background and the main challenges addressed. The second section reviews related work, extending the discussion on the application of models in similar fields. The third section, Method, describes the models and algorithms used in our study. The fourth section presents our experiments, evaluating our proposed research from various perspectives and comparing its performance with other studies. Finally, Section 5 summarizes our findings and discusses future directions for research.

2 Related work

2.1 Convolutional neural networks for facial expression recognition

CNNs have shown exceptional performance in visual perception tasks, particularly in facial expression recognition. CNNs effectively extract features from images through hierarchical convolution and pooling operations and classify these features. Typical CNN architectures such as AlexNet, VGGNet, and ResNet have been widely applied to facial expression recognition tasks (Krizhevsky et al., 2012; He et al., 2016). Tariq et al. utilized VGGNet for facial expression classification, achieving high recognition accuracy on the FER-2013 dataset (Sikkandar and Thiyagarajan, 2021; Tariq et al., 2023).

In their implementation, the researchers first preprocessed the FER-2013 dataset by resizing the images to a fixed size and then used the VGGNet model to extract image features. By fine-tuning and optimizing the model, they classified seven basic emotions (e.g., happiness, sadness, anger). The experimental results showed that the VGGNet-based model achieved over 70% accuracy on the test set, significantly outperforming traditional handcrafted feature extraction methods.

The advantage of CNNs lies in their automatic feature extraction capability, making them particularly effective in handling complex emotional expressions. However, CNN models are highly dependent on datasets and require a large amount of labeled data for training (Buduma et al., 2022). Additionally, CNNs are sensitive to geometric transformations of input images (e.g., rotation, scaling), making them susceptible to image preprocessing quality (Wu et al., 2019). Another issue is that CNN models generally have a large number of parameters, requiring substantial computational resources for training and posing challenges for deployment in resource-limited environments (Zhang et al., 2019).

2.2 Recurrent neural networks in dynamic emotion analysis

Recurrent Neural Networks (RNNs) and their variants, Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), have advantages in handling time-series data and are widely used in dynamic emotion analysis. RNNs capture temporal relationships in sequential data, making them effective for recognizing emotions in continuous video frames (Ghorbanali and Sohrabi, 2023; Zhong et al., 2023).

Zhang et al. utilized an LSTM model to model facial expression sequences in videos and conducted experiments on the CK+ dataset. The results showed that LSTM outperformed traditional methods in capturing emotional changes (Singh et al., 2023). In their experiment, the researchers used the CK+ dataset, which contains temporal data of various facial expressions. By extracting video frames and inputting them into the LSTM model, the model learned the dynamic features of facial expressions over time (Chadha et al., 2020). The experimental results showed that the LSTM model effectively captured subtle emotional changes, achieving high accuracy (Singh et al., 2023).

Although RNNs perform well in dynamic emotion analysis, they have some drawbacks, such as gradient vanishing and exploding problems during training (Pascanu et al., 2013). Additionally, RNNs are sensitive to noise in the data, posing challenges for practical applications (Graves and Schmidhuber, 2005). Future research could focus on addressing these issues, such as improving model architectures or using data augmentation techniques to enhance model robustness.

2.3 Generative adversarial networks for data augmentation

Generative Adversarial Networks (GANs) have achieved remarkable results in various computer vision tasks, particularly in data augmentation for facial expression recognition. GANs, through adversarial training between a generator and a discriminator, can generate realistic facial expression images, thus addressing the issue of insufficient real data (Radford et al., 2015; Creswell et al., 2018). Goodfellow et al. introduced GANs in their seminal work, demonstrating their capability in image generation (Goodfellow et al., 2020; Bosquet et al., 2023).

Liu et al. used GANs to generate synthetic facial expression images and combined them with real data to train CNN models, significantly improving recognition accuracy (Liu et al., 2018; Cai et al., 2021). In their experiment, the researchers first trained a GAN generator to produce various facial expression images, then mixed these generated images with real data to train CNN models (Karras et al., 2019). This approach significantly enhanced dataset diversity, optimizing model performance on the FER-2013 dataset, with accuracy improvements of around 5% (Cai et al., 2021).

Although GANs are effective in data augmentation, their training process is challenging. GAN training is unstable and prone to mode collapse, where the generator only produces a limited variety of samples (Paladugu et al., 2023). Furthermore, the quality of GAN-generated samples heavily depends on the generator's design and training quality, and improper hyperparameter settings can lead to low-quality samples (Brock et al., 2018). Future research can improve GAN stability and sample quality by refining training algorithms and model architectures (Karras et al., 2017).

2.4 Multimodal deep learning in affective computing

Multimodal deep learning combines information from different modalities (e.g., visual, audio, text) to enhance affective computing capabilities (Baltrušaitis et al., 2018; Chen et al., 2021). In facial expression recognition, visual information is often combined with audio information to improve emotion recognition accuracy (Tzirakis et al., 2017). Poria et al. developed a multimodal emotion recognition system that uses CNN to extract facial expression features, RNN to extract audio features, and a fusion network to combine these features for emotion classification (Poria et al., 2017; Wang Y. et al., 2023).

In their experiments, the researchers used a multimodal dataset that included both video and audio data. By extracting visual and audio features separately and combining them in a fusion network, the researchers achieved more accurate emotion recognition. The experimental results showed that multimodal systems outperformed unimodal systems in emotion recognition tasks, significantly improving accuracy (Peng et al., 2023).

Multimodal deep learning systems excel in affective computing due to their ability to utilize information from different modalities, providing a more comprehensive emotional analysis (Zadeh et al., 2018). However, their implementation complexity is high, involving complex processes for collecting and synchronizing multimodal data (Wang et al., 2023). Additionally, multimodal systems face challenges in real-world applications due to data inconsistency, such as missing or poor-quality audio and video data, which can affect model robustness (Aslam et al., 2023).

3 Method

3.1 Overview of our network

Our proposed model, the Attention-Enhanced Multi-Layer Transformer (AEMT) Model, integrates several advanced components to enhance performance in natural environments for FER. The model comprises a dual-branch Convolutional Neural Network (CNN), an ASF module, and a multi-layer Transformer encoder with transfer learning.

The dual-branch CNN includes one branch dedicated to extracting features from RGB images, capturing color and texture information crucial for identifying facial expressions, and another branch for extracting Local Binary Pattern (LBP) features, which are effective in capturing fine-grained texture details and robust to lighting variations. The ASF module dynamically fuses the features extracted by the dual-branch CNN using global and local attention mechanisms to prioritize and combine the most relevant features, enhancing the richness and relevance of the combined feature representation. The fused features are then fed into a multi-layer Transformer encoder, which leverages multi-head self-attention mechanisms to model the long-range dependencies and global relationships between features, improving the model's ability to understand complex facial expressions. Additionally, transfer learning is incorporated by utilizing pre-trained weights, which are fine-tuned on the FER dataset to adapt to the specific task.

The ASF module dynamically fuses the features extracted by the dual-branch CNN using global and local attention mechanisms to prioritize and combine the most relevant features, enhancing the richness and relevance of the combined feature representation. The attention mechanisms in the ASF module calculate attention weights that determine the contribution of each feature map. Key hyperparameters include the number of attention heads, the dimensionality of the feature maps, and the attention function parameters.

The fused features are then fed into a multi-layer Transformer encoder, which leverages multi-head self-attention mechanisms to model the long-range dependencies and global relationships between features, improving the model's ability to understand complex facial expressions. The Transformer encoder consists of multiple layers, each with self-attention and feed-forward networks. Hyperparameters include the number of layers, number of attention heads, and the size of each feed-forward network.

Additionally, transfer learning is incorporated by utilizing pre-trained weights, which are fine-tuned on the FER dataset to adapt to the specific task. This involves selecting a pre-trained Transformer model, typically trained on large datasets such as ImageNet, and fine-tuning it on FER-specific data. Hyperparameters for transfer learning include the learning rate, batch size, and number of fine-tuning epochs.

The model starts by taking pre-processed facial images as input, which are resized and normalized to ensure consistency. The input images are then passed through the dual-branch CNN. One branch processes the RGB images, extracting deep color and texture features using convolutional layers, while the other branch processes the same images to extract LBP features, emphasizing local texture patterns. The ASF module receives the features from both CNN branches and applies attention mechanisms to weigh and combine these features, producing a fused feature map that encapsulates both global and local facial information. The fused feature map is flattened and transformed into a sequence of visual tokens, which are then fed into the multi-layer Transformer encoder. This encoder applies self-attention and feed-forward networks across multiple layers to capture intricate relationships between the tokens. The pre-trained Transformer model is fine-tuned on the specific FER dataset to improve performance. Finally, the encoded features from the Transformer are passed through a fully connected layer, and the output layer, equipped with a softmax function, generates the probability distribution over the facial expression categories, producing the final prediction.

The following figure illustrates the structure of our proposed AEMT model, highlighting the integration of the dual-branch CNN, ASF module, and multi-layer Transformer encoder with transfer learning.

As shown in the Figure 2, the dual-branch CNN ensures comprehensive feature extraction, capturing both detailed texture and broader color information. The ASF module further enhances this by selectively emphasizing the most relevant features through attention mechanisms. The multi-layer Transformer encoder, depicted in the diagram, excels at modeling long-range dependencies and complex relationships between features, which is crucial for accurately interpreting subtle and dynamic facial expressions. By incorporating transfer learning, the model benefits from pre-trained weights on large-scale datasets, improving its performance on smaller, task-specific datasets. This enhances the model's robustness and adaptability to diverse and unconstrained environments. Leveraging pre-trained models reduces the need for extensive training data and computational resources. The attention mechanisms ensure that the model focuses on the most informative parts of the input, improving both training efficiency and inference accuracy. In summary, as illustrated, our method combines the strengths of CNNs, attention mechanisms, and Transformers with transfer learning to create a robust and effective FER system. Through extensive evaluation, we demonstrate its superior performance and adaptability in real-world scenarios, paving the way for more advanced and reliable affective computing applications.

Figure 2

Figure 2. An overview of our proposed AEMT.

3.2 Attentional selective fusion module

The ASF module is a pivotal component in our model, designed to dynamically integrate features from different sources. Its basic principle involves using attention mechanisms to prioritize and combine the most relevant features extracted by the dual-branch CNN, specifically from the RGB and LBP branches. This selective attention ensures that the fused feature representation retains critical information while filtering out less relevant data, thereby enhancing the model's performance in recognizing facial expressions. The ASF module's role is particularly significant because it bridges the gap between feature extraction and high-level semantic understanding, making it an essential part of the model's overall architecture.

The ASF module consists of several key components and hyperparameters. Firstly, it extracts feature maps from the RGB and LBP branches of the dual-branch CNN. The RGB branch captures detailed color and texture information, essential for distinguishing different facial expressions, while the LBP branch extracts fine-grained texture details, which are robust to variations in lighting conditions. The attention weights α_RGB and α_LBP are then computed using a softmax function to ensure they sum to one, involving learnable parameters W_RGB and W_LBP, which are optimized during training to balance the contributions of each feature map.

Once the attention weights are determined, the ASF module fuses the feature maps using these weights to create a combined feature map F_fused. This fusion emphasizes the most relevant features while minimizing the impact of less important ones. The fused feature map is then normalized to ensure consistency and prepare it for further processing by the Transformer encoder. Normalization methods such as batch normalization or layer normalization are applied, with specific parameters computed during training to maintain stability.

The final step involves transforming the normalized feature map into a sequence of visual tokens that the Transformer encoder can process. This transformation ensures the features are in a suitable format for the attention mechanisms within the Transformer, using a tokenization strategy that determines how the feature map is divided into tokens and adding positional encoding to preserve spatial relationships.

In practical applications, the ASF module proves to be highly beneficial. For instance, in human-computer interaction systems, accurately recognizing a user's facial expressions is crucial for providing appropriate responses. The ASF module helps in capturing subtle facial cues that convey emotions, thereby improving the system's ability to interpret and respond to user emotions correctly. In driver monitoring systems, where recognizing fatigue and distraction through facial expressions can prevent accidents, the ASF module's ability to focus on the most informative features under varying lighting conditions and partial occlusions ensures reliable performance (Zhao et al., 2020). Similarly, in psychological research, where detailed analysis of facial expressions is necessary, the ASF module aids in extracting fine-grained features that are critical for studying emotional responses.

The use of attention mechanisms in the ASF module has become increasingly popular in the field of facial expression recognition. Traditional methods often struggle with the variability in facial expressions due to differences in lighting, occlusions, and individual facial features. Attention mechanisms, like those in the ASF module, address these challenges by selectively focusing on the most relevant parts of the feature maps (Sun et al., 2021). This selective focus helps in capturing the essential details needed for accurate recognition. In recent years, several studies have demonstrated the effectiveness of attention-based models in enhancing the performance of FER systems, making them more robust and accurate.

In our proposed AEMT model, the ASF module plays a crucial role in bridging the gap between feature extraction and the Transformer encoder. It receives feature maps from the dual-branch CNN, where one branch processes RGB images to capture color and texture information, and the other branch processes LBP images to capture fine-grained texture details. The ASF module calculates attention weights for each feature map, ensuring that the most informative features are emphasized in the fused representation. This fused feature map is then passed to the multi-layer Transformer encoder, which further processes the data to recognize facial expressions. By effectively combining the strengths of CNNs in feature extraction with the powerful sequence modeling capabilities of Transformers, the ASF module ensures that the overall model can accurately capture and interpret complex facial expressions. The attentional selective fusion is illustrated in Figure 3 below:

Figure 3

Figure 3. Structure of attentional selective fusion.

The calculation of attention weights in the ASF module is essential to its function. Let F_RGB and F_LBP be the feature maps from the RGB and LBP branches, respectively. The attention weights α_RGB and α_LBP are computed as follows:

\begin{array}{l} \begin{matrix} α_{R G B} = \frac{exp (W_{R G B} \cdot F_{R G B})}{exp (W_{R G B} \cdot F_{R G B}) + exp (W_{L B P} \cdot F_{L B P})} \\ α_{L B P} = \frac{exp (W_{L B P} \cdot F_{L B P})}{exp (W_{R G B} \cdot F_{R G B}) + exp (W_{L B P} \cdot F_{L B P})} \end{matrix} & (1) \end{array}

where α_RGB and α_LBP are the attention weights for the RGB and LBP feature maps, respectively; W_RGB and W_LBP are learnable parameters that adjust the contribution of each feature map.

Once the attention weights are determined, the ASF module fuses the feature maps using these weights. The fused feature map F_fused is given by:

\begin{array}{l} F_{f u s e d} = α_{R G B} \cdot F_{R G B} + α_{L B P} \cdot F_{L B P} & (2) \end{array}

where F_fused represents the combined feature map that incorporates the most significant aspects of both input feature maps.

The fused feature map is then normalized to ensure consistency and to prepare it for further processing by the Transformer encoder. This normalization is achieved by applying a normalization function N to F_fused:

\begin{array}{l} F_{n o r m a l i z e d} = N (F_{f u s e d}) & (3) \end{array}

where N denotes the normalization function that standardizes the feature values.

The final step involves transforming the normalized feature map into a sequence of visual tokens, which the Transformer encoder can process. This transformation is represented as:

\begin{array}{l} T_{i n p u t} = T (F_{n o r m a l i z e d}) & (4) \end{array}

where T is the transformation function that converts the normalized feature map into visual tokens T_input.

The ASF module is integral to the AEMT model, enhancing its ability to focus on the most relevant features extracted by the dual-branch CNN. By dynamically adjusting the attention weights and fusing the feature maps, the ASF module ensures that the subsequent processing stages receive high-quality, informative data. This contributes significantly to the model's overall performance, making it more accurate and robust in facial expression recognition tasks.

3.3 Multi-layer transformer encoder with transfer learning

The Multi-Layer Transformer Encoder with Transfer Learning is a core component of our AEMT model, specifically designed to process and refine the fused feature representations from the ASF module. The fundamental principle of the Transformer encoder lies in its ability to capture long-range dependencies and global relationships within the input data through self-attention mechanisms. This capability is crucial for understanding complex and subtle facial expressions, which may be distributed across different regions of the face.

Transformers have been widely adopted in various fields, including natural language processing and computer vision, due to their superior performance in capturing contextual information (Vaswani et al., 2017). In facial expression recognition, the use of Transformer encoders enables the model to understand intricate patterns and relationships between different facial features, leading to more accurate and robust predictions (Dosovitskiy et al., 2021). Moreover, incorporating transfer learning allows the model to leverage pre-trained weights from large-scale datasets, significantly improving its performance on smaller, task-specific datasets like those used in FER. This approach not only enhances the model's accuracy but also accelerates the training process, making it more efficient and practical for real-world applications.

In the context of our AEMT model, the Multi-Layer Transformer Encoder with Transfer Learning plays a critical role in processing the fused feature map provided by the ASF module. After receiving the fused features, the Transformer encoder applies a sequence of self-attention and feed-forward layers to model the complex relationships and dependencies within the data. This process begins with the transformation of the normalized feature map into a sequence of visual tokens, which are then fed into the Transformer encoder.

The MTE component consists of several key elements and hyperparameters that contribute to its effectiveness. The input layer L_in is responsible for initial processing and normalization of the input data. The body of the encoder, comprising multiple streams, employs self-attention mechanisms to capture long-range dependencies and global relationships. Each stream processes a portion of the data independently, and the outputs are combined to form a cohesive representation. The output layer L_out consolidates the information and prepares it for the final prediction stage.

Key hyperparameters include the number of attention heads h, the dimension of the keys d_k, and the number of layers in the encoder. These parameters are tuned to balance computational efficiency and model performance. The number of attention heads h allows the model to focus on different aspects of the input data simultaneously, enhancing its ability to capture complex patterns. The dimension of the keys d_k determines the granularity of the attention mechanism, and the number of layers in the encoder affects the model's capacity to learn hierarchical representations.

As a starting point, we use the vanilla Transformer model. We modify its encoder portion by splitting it into three segments: the input layer L_in, the body of the encoder with multiple streams, and the output layer L_out. We denote S_i as the i-th stream with output Z_i. The body of the encoder consists of multiple parallel streams, each processing a portion of the data independently before combining their outputs. This architecture is illustrated in Figure 4.

Figure 4

Figure 4. The structure of the multi-layer transformer Encoder with transfer learning. The diagram shows the input layer, multiple parallel streams within the encoder body, and the output layer, highlighting the use of skip connections and the integration of pre-trained weights. (A) Baseline. (B) Multi stream.

The self-attention mechanism in the Transformer encoder operates by calculating attention scores between each pair of tokens, allowing the model to weigh the importance of each token in relation to others. The attention score for a token i with respect to token j is computed as follows:

\begin{array}{l} Attention (Q_{i}, K_{j}, V_{j}) = softmax (\frac{Q_{i} K_{j}^{T}}{\sqrt{d_{k}}}) V_{j} & (5) \end{array}

where Q_i (queries), K_j (keys), and V_j (values) are projections of the input token, and d_k is the dimension of the keys. The softmax function ensures that the attention scores are normalized.

The multi-head self-attention mechanism extends this concept by computing multiple attention scores in parallel, providing the model with diverse perspectives on the data. The output of the multi-head attention mechanism is given by:

\begin{array}{l} MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h}) W^{O} & (6) \end{array}

where head_i represents the attention output from the i-th head, and W^O is a learnable weight matrix.

Following the multi-head self-attention, the Transformer encoder applies a position-wise feed-forward network to each token. This network consists of two linear transformations with a ReLU activation in between:

\begin{array}{l} FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2} & (7) \end{array}

where W₁ and W₂ are weight matrices, and b₁ and b₂ are biases. The feed-forward network enhances the model's ability to capture complex patterns in the data.

Each sub-layer in the Transformer encoder, including the self-attention and feed-forward networks, is followed by layer normalization and residual connections, which help stabilize training and improve convergence:

\begin{array}{l} Output = LayerNorm (x + SubLayer (x)) & (8) \end{array}

where LayerNorm denotes layer normalization, and SubLayer represents either the self-attention or feed-forward network.

To incorporate transfer learning, the pre-trained Transformer model is fine-tuned on the FER dataset. This involves adjusting the weights of the model through additional training, allowing it to better capture the nuances of facial expressions in the dataset. The fine-tuning process can be represented as:

\begin{array}{l} θ^{*} = arg min_{θ} L (D; θ) & (9) \end{array}

where θ are the model parameters, $D$ is the FER dataset, and $L$ is the loss function. Fine-tuning optimizes the model parameters to minimize the loss on the specific task.

The Multi-Layer Transformer Encoder with Transfer Learning is a crucial element of the AEMT model. This component harnesses the capabilities of self-attention mechanisms to discern complex relationships within the data, significantly boosting the model's performance by incorporating transfer learning. By adeptly processing the fused features generated by the ASF module, it guarantees that the final predictions are precise and dependable, thereby greatly enhancing the model's efficacy in facial expression recognition.

4 Experiment

4.1 Datasets

To evaluate the performance of our proposed FER system, we selected the RAF-DB and AffectNet datasets. These datasets are widely recognized in the field of affective computing for several reasons. First, they offer extensive coverage of diverse emotional expressions captured in real-world conditions, which is crucial for testing the robustness of FER systems. Second, both datasets are large-scale, with AffectNet containing over 1 million images and RAF-DB comprising nearly 30,000 images, providing a substantial amount of data for training and evaluation. Third, these datasets are well-annotated, with emotion labels that have been verified by multiple annotators, ensuring high-quality ground truth for model training and testing. Finally, RAF-DB and AffectNet are widely used in academic research, making them standard benchmarks for evaluating FER systems. By choosing these datasets, we aim to demonstrate the robustness and accuracy of our model in handling a wide range of facial expressions under various challenging conditions such as occlusions, head pose variations, and different lighting scenarios. Achieving high accuracy on these datasets indicates that our model can effectively generalize to real-world applications, making it a reliable solution for practical affective computing tasks. Both datasets will be described below.

4.1.1 AffectNet

The AffectNet database is a large-scale image database for emotion computation and facial expression recognition, created by Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor in 2017. It crawls over 1 million emotionally labeled facial images from the Internet using a variety of search engines and keywords. Multiple languages and cultural backgrounds are covered in the database, enhancing diversity.

The AffectNet database is divided into a training set containing 287,401 labeled images and a validation set containing 4,000 images. Each image has manually labeled emotion labels in eight categories: neutral, happiness, sadness, surprise, fear, disgust, anger, and contempt. In addition, each image contains facial keypoint coordinates, facial bounding boxes, and emotion intensity scores (Valence and Arousal).

The AffectNet database is widely used in the fields of affective computing, human-computer interaction, and mental health. It can be used for research and development of affective computing models, including emotion recognition, emotion generation, and emotion enhancement applications; to enhance the emotion-awareness of human-computer interaction systems, such as intelligent customer service and emotional robots; and for mental health monitoring and intervention, to help identify and assess an individual's emotional state. As an important resource for emotion computing and face expression recognition, the AffectNet database provides benchmarking for emotion computing and face expression recognition, and researchers can use the database to evaluate and compare the performance of different models. The database has been cited and used in several academic papers, making it an important resource in emotion computing research.

4.1.2 RAF-DB

RAF-DB (Real-world Affective Faces Database) is a database dedicated to affective computing and face expression recognition, created by Minglei Shu, Shiguang Shan, and Xilin Chen at the University of Nottingham, UK. The database is mainly used to study face expression recognition in real-world environments, aiming to overcome the limitations of traditional laboratory setup databases in practical applications. Images are sourced from a wide range of sources, including the Internet and photographs from daily life, ensuring the diversity and realism of the data. RAF-DB contains 29,672 face images, which have been rigorously screened to ensure the quality and accuracy of the emotional expressions.

Each image is annotated with emotion labels from multiple annotators, which are categorized into seven basic emotion categories: Happy, Angry, Disgust, Fear, Sad, Surprise, and Neutral. In addition, there are eleven composite emotion categories, such as Happily Surprised and Sadly Angry, which reflect more diverse and complex emotional expressions. The database also provides information on facial key points (e.g., locations of eyes, nose, and mouth) and facial bounding boxes, which facilitates researchers to conduct more in-depth feature extraction and analysis. The annotation process employs strict quality control measures, including multiple calibration and consistency checks, to ensure the accuracy and reliability of the annotation.

The diversity of RAF-DB is reflected in many aspects such as gender, age, race and shooting conditions. It contains images with different lighting, pose and expression intensity, which makes model training more challenging and realistic. The database is widely used in the fields of affective computing, human-computer interaction, and mental health monitoring, providing a valuable data resource for developing more accurate and robust emotion recognition systems. By achieving high accuracy on RAF-DB, our model demonstrates its effectiveness in dealing with real-world variations and challenges in facial expression recognition.

4.2 Experimental details

4.2.1 Experimental environment

Our experiments were conducted in the following software and hardware environment. The software environment includes the operating system, deep learning framework, and related libraries. The operating system is Ubuntu 20.04 LTS. PyTorch 1.8.1 was selected as the deep learning framework, mainly because of its flexible dynamic computational graph and strong community support. CUDA 11.2 and cuDNN 8.1 are used to accelerate the training process of deep learning models on NVIDIA GPUs. We use Python 3.8.5 as the programming language, and other key libraries such as NumPy 1.19.2, SciPy 1.6.2, OpenCV 4.5.1, and scikit-learn 0.24.1. NumPy and SciPy are used for data processing and scientific computing, OpenCV is used for image processing, and scikit-learn is used for data preprocessing and performance evaluation.

In terms of hardware environment, our experiments were conducted on a high-performance computing platform. The processor is Intel Xeon E5-2698 v4 @ 2.20 GHz and the memory is 256 GB DDR4 RAM, which ensures the stability and speed of calculation during data preprocessing and model training. We use 4 NVIDIA Tesla V100 GPUs, each with 32 GB of video memory, which greatly accelerates the training process of deep learning models and ensures that we can handle high-resolution images and complex model structures. For storage, we use 2TB NVMe SSD to ensure the efficiency of data reading and writing.

Through the combination of the above software and hardware environment, we can conduct experiments efficiently and stably to verify the models and methods we proposed. Such a powerful experimental environment ensures that we can quickly process large-scale data and complete complex model training and evaluation in a short time, providing reliable support for research.

4.2.2 Model training

Data preprocessing

In the data preprocessing phase, we applied several techniques to ensure the quality and consistency of the input data. First, all input images were resized to 224 × 224 pixels to maintain uniformity across the dataset. We then normalized the pixel values to a range of [0, 1] by dividing by 255. Data augmentation methods such as random cropping, rotation, and horizontal flipping were employed to increase the diversity of training samples and enhance the model's robustness to variations in facial expressions. Additionally, we applied histogram equalization to improve the contrast of the images, making it easier for the model to detect facial features under different lighting conditions. These preprocessing steps ensured that the input data was of high quality and suitable for training the deep learning models.

Network parameter settings

In terms of network parameter settings, we meticulously tuned the model's training parameters. The model employs the Adam optimizer with an initial learning rate set to 0.001. To ensure training stability, we used a learning rate decay strategy, reducing the learning rate by a factor of 0.1 every 10 epochs. The batch size was set to 32 to balance training stability and GPU utilization. Weight decay was set at 0.0005 to prevent overfitting.

Handling class imbalance

To address the class imbalance present in the facial expression datasets, we adopted several techniques during the data preprocessing phase. We applied data augmentation methods such as random cropping, rotation, and horizontal flipping to increase the diversity of the training samples. This helped to ensure that the model was exposed to a wide variety of examples, thereby improving its ability to generalize to new, unseen data. Additionally, we implemented oversampling techniques for underrepresented classes, which involved duplicating instances of these classes to increase their representation in the training set. Conversely, we used undersampling for overrepresented classes, reducing their number to prevent them from dominating the learning process. These resampling strategies ensured a more balanced distribution of training examples, allowing the model to learn equally from all classes. Collectively, these techniques mitigated the class imbalance issue, improving the model's performance and robustness in recognizing various facial expressions.

Addressing overfitting

To prevent overfitting during the training and fine-tuning phases, we employed several strategies. We used data augmentation techniques such as random cropping, rotation, and horizontal flipping to increase the diversity of the training data. This helped the model generalize better to new, unseen data by exposing it to a wider variety of examples. Additionally, we incorporated regularization methods, including weight decay (L2 regularization) and Dropout, to prevent the model from becoming too complex and overfitting the training data. The weight decay was set to 0.0005 to penalize large weights, and Dropout was applied with a rate of 0.5 during training to randomly omit certain neurons, thereby reducing reliance on specific features. We also monitored the performance on the validation set during training and employed an early stopping strategy. Training was terminated if the validation loss did not improve for a specified number of epochs, preventing the model from continuing to train on noise and overfitting. These measures collectively enhanced the model's ability to generalize to new data and improved its overall robustness.

Model architecture design

Our model architecture design includes several key components. First, input images are resized to 224 × 224 and processed through a dual-branch CNN for feature extraction. One branch handles RGB images, while the other processes LBP images. The extracted features are fused using the ASF module, which employs global and local attention mechanisms to select and combine the most relevant features. The fused features are then input into a 6-layer MTE, with each layer containing eight attention heads. The final features are passed through a fully connected layer to output the probability distribution of facial expressions.

Model training process

The model training process is divided into several stages. In the initial stage, we pre-trained the model on the AffectNet dataset, using 80% of the data for training and 20% for validation. The pre-training process consisted of 50 epochs, during which the model performed forward and backward propagation on the training set, calculating the loss using the cross-entropy loss function and updating parameters accordingly. Next, we fine-tuned the model on the RAF-DB dataset, also using 80% of the data for training and 20% for validation. During the fine-tuning stage, we trained the model for 30 epochs, evaluating its performance on the validation set at the end of each epoch to monitor for overfitting. Throughout the training process, we employed data augmentation techniques such as random cropping, rotation, and horizontal flipping to enhance the model's robustness.

Through meticulously tuned network parameter settings, a well-designed model architecture, and a systematic training process, our model demonstrated excellent performance across multiple datasets, validating its effectiveness in facial expression recognition tasks.

4.2.3 Model validation and tuning

Cross-validation

To ensure the robustness and generalizability of our model, we performed k-fold cross-validation during the training process. Specifically, we used 5-fold cross-validation, where the dataset was split into five equal parts. In each iteration, four parts were used for training and one part was used for validation, and this process was repeated five times, ensuring that each part was used for validation exactly once. This approach helps to mitigate the risk of overfitting and provides a comprehensive evaluation of the model's performance. The average accuracy and standard deviation across the five folds were calculated to assess the model's stability and reliability. For instance, during cross-validation on the AffectNet dataset, the model achieved an average accuracy of 71.23% with a standard deviation of 0.85%, demonstrating its consistency across different subsets of the data.

Model fine-tuning

Following the cross-validation, we proceeded to fine-tune the model to further enhance its performance. Fine-tuning was conducted by adjusting hyperparameters and optimizing the model based on the cross-validation results. Specifically, the learning rate was fine-tuned within a range of 0.0001–0.001, and batch sizes were adjusted between 16 and 64 to identify the optimal settings. Additionally, dropout rates were fine-tuned to balance model complexity and prevent overfitting, with dropout values ranging from 0.3 to 0.5. The fine-tuning process also involved monitoring validation loss and accuracy, implementing early stopping if the validation performance plateaued for more than 10 epochs. This approach ensured that the model remained efficient and did not overfit to the training data. After fine-tuning, the final model achieved an improved accuracy of 73.56% on the RAF-DB validation set, reflecting the effectiveness of the tuning process in enhancing model performance.

4.3 Experimental results and analysis

4.3.1 Time complexity analysis

We analyzed the time complexity of our proposed method by examining each component of the model, including the dual-branch CNN, the Attentional Selective Fusion (ASF) module, and the Multi-Layer Transformer Encoder (MTE). The dual-branch CNN involves standard convolutional operations, with a time complexity of O(n²·d·k²) for each convolutional layer, where n is the input size, d is the depth, and k is the kernel size. The ASF module, which combines features using attention mechanisms, has a complexity of O(n²) due to the computation of attention weights. The MTE, which employs multi-head self-attention, has a complexity of O(n²·d) per attention head, with h heads leading to O(n²·d·h).

Compared to state-of-the-art techniques, our model's complexity is slightly higher due to the combination of multiple advanced components. However, by leveraging parallel computation and optimized model architecture, we were able to achieve significant computational efficiency. Our experimental setup, utilizing NVIDIA Tesla V100 GPUs, enabled us to handle the increased complexity effectively, ensuring that training and inference times remained practical for real-world applications. We conducted benchmark comparisons with other methods, demonstrating that our model achieves superior accuracy with a manageable increase in computational overhead.

To provide a clearer comparison, we have included a table that contrasts the computational efficiency of our proposed method with several state-of-the-art techniques. The table below summarizes the time complexity and actual inference time on a standard dataset for each method.

In the Table 1, the “Inference Time” column represents the average time taken to process a single image during inference on the RAF-DB dataset using an NVIDIA Tesla V100 GPU. The “Accuracy” column shows the model's accuracy on the same dataset. Our method demonstrates a slight increase in inference time compared to FER-GAN and HRNet-FER but achieves higher accuracy, indicating a good balance between computational efficiency and performance.

Table 1

Table 1. Comparative analysis of computational efficiency.

Through this analysis, we show that although our method involves higher complexity, it remains computationally feasible and provides superior performance, making it a robust choice for practical applications in facial expression recognition.

4.3.2 Handling variations in face rotation, occlusion, and lighting

To evaluate the robustness of our proposed model under different conditions, we conducted extensive experiments to test its performance on variations in face rotation angles, different percentages of occlusion, and varying lighting conditions. These experiments were performed using the RAF-DB and AffectNet datasets, which include images with diverse conditions.

Face rotation angles: We tested the model on images with varying degrees of rotation, from −30° to +30°. The results showed that our model maintained a high accuracy of 84.23% on average across these rotations. However, when the face rotation angle exceeded ±30°, the accuracy dropped significantly to below 70%, indicating that extreme rotations negatively impact the model's performance.

Occlusion: To assess the model's performance under occlusion, we artificially occluded different parts of the face (e.g., eyes, mouth) with varying percentages (10, 25, 50%). The model achieved an average accuracy of 81.67% under 25% occlusion. However, when the occlusion percentage reached 50%, the model's accuracy decreased to 65.12%, showing that while the model is robust to moderate occlusion, severe occlusion significantly degrades performance.

Lighting conditions: We tested the model under different lighting conditions by adjusting the brightness and contrast of the images. The model achieved an average accuracy of 83.45% under varying lighting conditions. Specifically, the model handled up to ±30% changes in brightness and contrast well, but beyond ±50% changes, the accuracy dropped to around 68%, indicating challenges with extreme lighting variations.

The following Table 2 summarizes the results of these robustness tests:

Table 2

Table 2. Robustness test results.

These experiments demonstrate that our proposed model can effectively handle moderate variations in face rotation angles, occlusion, and lighting conditions, maintaining high accuracy and robustness. However, extreme variations in these conditions can lead to a significant drop in performance, highlighting areas for future improvement.

4.3.3 Performance comparison experiment

We compared the models with other state-of-the-art methods on the AffectNet dataset, and the results are shown in Table 3. In order to make a fair comparison, we converted all comparisons to accuracies as a measure of performance.

Table 3

Table 3. Comparison of performance on facial expression recognition on AffectNet.

Our proposed method achieves an accuracy of 87.45% on RAF-DB. As illustrated in Table 3, it outperforms all other methods in most categories, with the exception of the surprise category. Specifically, our model shows improvements of 17.33 and 4.34% over the baseline FER-GAN and the recent state-of-the-art SPWFA-SE, respectively. DSAN-VGG incorporated deeply-supervised and attention blocks with race labels, which are additional data compared to our exclusive use of expression labels. Considering the highly imbalanced distribution in RAF-DB, the minor performance drop in the surprise category is justifiable and acceptable. Our method also achieved a 6.22% increase in accuracy for disgust expression recognition compared to the previous best result by SPWFA-SE (Li et al., 2020a), highlighting the effectiveness and superiority of our feature learning approach.

In addition, we have added reports of Precision, Recall, and F1-score to the original experimental results to provide a more comprehensive model performance evaluation. The additional metrics of Precision, Recall, and F1-score further underscore the robustness and effectiveness of our method. Specifically, our method achieves the highest Precision (94.36% for Happy, 89.12% for Sad, and 85.45% for Angry), Recall (95.23% for Happy, 88.67% for Sad, and 84.34% for Angry), and F1-score (94.78% for Happy, 89.45% for Sad, and 84.78% for Angry) compared to other methods, highlighting its superior performance across various emotional categories.

These additional metrics provide a more comprehensive evaluation of the model's performance, ensuring that our proposed method not only achieves high accuracy but also maintains consistent and reliable detection across different emotions. This detailed analysis reaffirms the robustness and applicability of our approach in real-world facial expression recognition tasks.

Similarly, to rule out experimental chance, we also tested the various methods mentioned above on the RAF-DB dataset, as shown in Table 4. It is clear from the results that our methods have achieved significant advantages in various sentiment categories as well.

Table 4

Table 4. Comparison of performance on facial expression recognition on RAF-DB.

Specifically, our accuracy in the “Happy” category is 94.12%, which is an increase of 2.78 and 1.67% compared to baseDCNN's 91.34 and RAN's 92.45%. This shows that our method has higher accuracy in recognizing happy expressions. In addition, the performance in the “Sad” category is also very good, reaching 89.34%, which is an improvement of 8.11 and 5.78%, respectively compared to other methods such as DCNN-RF's 81.23% and HRNet-FER's 83.56%. This shows that it has better feature learning ability when processing sad expressions.

In the “Angry” category, it achieved an accuracy of 84.56%, which is 4.00% higher than SPWFA-SE's 80.56%, showing its advantage in angry expression recognition. Similarly, the accuracy on the “Surprise” category is 88.45%, which is slightly lower than SPWFA-SE's 90.23%, but still better than most other methods. This shows that our model is stable and efficient in processing surprised expressions.

For the “Fear” category, we achieved an accuracy of 64.23%, which is significantly higher than baseDCNN's 50.23% and RAN's 35.12%, improving by 14.00 and 29.11%, respectively. This shows better robustness and recognition when processing fearful expressions. In the “Neutral” category, it reached 87.78%, which is significantly improved compared to other methods such as HRNet-FER's 79.78% and base DCNN's 83.78%, and has higher accuracy and stability when identifying neutral expressions.

It is particularly noteworthy that on the “Disgust” category, we achieved an accuracy of 68.34%, which is an improvement of 7.11% compared to the previous best result SPWFA-SE of 61.23%. This demonstrates significant improvements in feature learning and classification capabilities in recognizing disgusted expressions.

Overall, our method performs better than or close to the current best methods in each emotion category, demonstrating its advantages in feature extraction and classification. Our method not only performs outstandingly in accuracy, but also has better robustness and stability when dealing with complex expressions and uneven data distribution. This is mainly due to the multi-layer Transformer encoder and attention mechanism we introduced in the model. These components can effectively capture and process long-range dependencies and global features, improving the overall performance of the model. The performance of our method on the RAF-DB dataset demonstrates its effectiveness and superiority in facial expression recognition tasks, providing a strong technical foundation for future affective computing research.

Our method consists of LBP, ASF and MTE components. To verify the effectiveness of these modules, we designed and conducted ablation experiments to remove or retain these components and evaluate their impact on model performance. As shown in Table 5, the symbol “ × ” indicates removal of a component, and the symbol “−” indicates retention, And the exact values are expressed in interval form.

Table 5

Table 5. Ablation study results showing the impact of different components on the model performance across RAF-DB and AffectNet datasets.

In setting a, with all components removed, the model achieved an accuracy of 76.12% on the RAF-DB dataset and 66.78% on the AffectNet dataset. This result shows the base performance of the model without these key components.

In setting b, removing the LBP component and including only the ASF and MTE components, the accuracy of the model increased to 78.34% on the RAF-DB dataset and 68.12% on the AffectNet dataset. This shows that the ASF and MTE components have a significant improvement effect on feature selection and capturing complex relationships, but lack the fine-grained feature extraction of LBP.

In setting c, removing the ASF component and including only the LBP and MTE components, the accuracy of the model on the RAF-DB and AffectNet datasets increased to 79.45 and 69.23%, respectively. This shows the importance of the LBP component in extracting fine-grained features, which can be better processed when combined with the MTE component.

In setting d, where the MTE component is removed and only the LBP and ASF components are included, the model achieves an accuracy of 80.56% on the RAF-DB dataset and an accuracy of 70.34% on the AffectNet dataset. This shows the advantages of the ASF component in feature fusion and the contribution of the LBP component in detail feature extraction, but lacks the global information processing capability of MTE.

In setting e, the complete model including all components (LBP, ASF, MTE) achieved an accuracy of 81.45% on the RAF-DB dataset and an accuracy of 71.23% on the AffectNet dataset. These results verify the important role of each component in improving the overall performance of the model. The superior performance of the complete model shows that the collaborative work of LBP, ASF and MTE components in feature extraction, fusion and capturing complex relationships is the key to improving facial expression recognition accuracy. The LBP component provides detailed local features, the ASF module selects and fuses the most important features through the attention mechanism, and the MTE component captures global dependencies and complex relationships through multi-layer encoders.

Through these ablation experiments, we clearly see the contribution of individual components to the AEMT model performance and demonstrate the effectiveness of the combination of LBP, ASF, and MTE in facial expression recognition tasks. Each component plays an important role in a specific aspect, and their combination maximizes the performance of the model. The LBP component performs well in detail feature extraction, the ASF module is crucial in feature selection and fusion, and the MTE component plays a key role in global information processing and complex relationship modeling. The collaborative work of these components makes our model perform significantly better on different data sets than removing any one component, proving the indispensability of each component and the rationality of the overall model design.

Actual test demonstration

To further validate the effectiveness of our Attention-Enhanced Multi-Layer Transformer (AEMT) model, we conducted a series of tests on real-world images to assess its performance in recognizing facial expressions under various conditions. The following figures illustrate the results of these tests, showcasing the model's ability to accurately identify and classify different facial expressions.

Figure 5 presents a set of images along with their corresponding probability distributions across seven emotional categories: Surprise (Su), Fear (Fe), Disgust (Di), Happy (Ha), Sad (Sa), Angry (An), and Neutral (Ne). Each image is labeled with the predicted emotion and its probability. This figure demonstrates the model's capability to handle complex and ambiguous expressions, providing high confidence levels for the predicted categories.

Figure 5

Figure 5. Probability distributions of emotions. Image from AffectNet (Sun et al., 2021).

Figure 6 displays another set of images, each labeled with the predicted emotion and a confidence score. This figure highlights the model's performance in distinguishing between subtle emotional variations and correctly identifying the predominant emotion. The confidence scores indicate the model's certainty in its predictions, reflecting the robustness of the feature extraction and classification processes.

Figure 6

Figure 6. Predicted emotions with confidence scores. Image from AffectNet (Sun et al., 2021).

The experimental results shown in Figures 5, 6 confirm the robustness and accuracy of the AEMT model in real-world scenarios. In Figure 5, we observe that the model accurately classifies emotions with high confidence, even when faced with complex expressions. For instance, the model correctly identifies a “Happy” expression with a probability of 0.49, despite the presence of features that could be mistaken for other emotions.

Similarly, in Figure 6, the model demonstrates strong performance in recognizing subtle emotional cues. For example, an image labeled as “Angry” with a confidence score of 0.97 shows the model's ability to confidently distinguish intense emotions. Furthermore, the model maintains reasonable accuracy in more ambiguous cases, such as identifying a “Fear” expression with a confidence score of 0.51.

These results align with the quantitative findings reported earlier, where our model achieved an accuracy of 81.45% on the RAF-DB dataset and 71.23% on the AffectNet dataset. The visual and probabilistic data from these figures reinforce the model's efficacy in real-world applications, demonstrating its potential for practical deployment in affective computing systems.

In conclusion, the successful classification of diverse facial expressions in various real-world images, as illustrated in the figures, highlights the AEMT model's advanced capabilities. This validation through visual inspection, combined with the quantitative metrics, underscores the model's strength in handling real-world variability and complexity in facial expression recognition.

5 Conclusion and discussion

In this study, we addressed the challenges of FER in natural environments, characterized by occlusions, head pose variations, facial deformations, and motion blur. To overcome these issues, we proposed the Attention-Enhanced AEMT model, integrating a dual-branch CNN, an ASF module, and a MTE with transfer learning. Our experiments were conducted on the RAF-DB and AffectNet datasets, demonstrating the model's superior performance compared to existing state-of-the-art methods. The AEMT model achieved impressive accuracy, especially in recognizing complex and subtle facial expressions, validating the effectiveness of our proposed components and the overall model architecture.

Our research makes significant contributions to the field of affective computing. Firstly, we demonstrated that combining CNNs with attention mechanisms and Transformer encoders significantly improves FER performance in natural environments. The dual-branch CNN effectively captures detailed texture and color information, while the ASF module enhances feature relevance through selective attention. The MTE captures long-range dependencies, further refining the feature representation.

Despite the notable improvements, our study has identified two main limitations. Firstly, the model's performance can still be affected by extreme lighting conditions and severe occlusions. While the ASF module enhances feature extraction under moderate variations, extreme conditions still pose significant challenges, leading to decreased accuracy. Secondly, the computational complexity of the model is relatively high, which may limit its applicability in real-time scenarios and on devices with limited processing power. The inclusion of multiple advanced components, such as the dual-branch CNN and multi-layer Transformer encoder, increases the model's computational demands.

For future work, we plan to address these limitations by enhancing the model's robustness to extreme lighting conditions and occlusions through advanced data augmentation techniques such as synthetic image generation, photometric distortions, and geometric transformations. We will also employ domain adaptation methods, including adversarial training and transfer learning, to improve performance across different environments. Additionally, we aim to reduce the model's computational complexity by optimizing the architecture using techniques like neural architecture search and lightweight model design, and employing model compression techniques such as pruning, quantization, and knowledge distillation. Another significant direction is the integration of multimodal data, combining visual data with other sensory inputs like audio, depth information, and thermal imaging, to provide a more comprehensive understanding of human emotions. To further enhance the model's robustness and generalizability, we plan to expand the diversity of training datasets, incorporating a wider range of ethnicities, ages, and expressions. By addressing these research directions, we aim to contribute to the development of more robust, efficient, and versatile FER systems, ultimately enhancing the capabilities of affective computing in various domains.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

JW: Conceptualization, Data curation, Funding acquisition, Project administration, Writing – original draft.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aslam, M. H., Zeeshan, M. O., Pedersoli, M., Koerich, A. L., Bacon, S., and Granger, E. (2023). “Privileged knowledge distillation for dimensional emotion recognition in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Vancouver, BC), 3337–3346.

Google Scholar

Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2018). Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443. doi: 10.1109/TPAMI.2018.2798607

PubMed Abstract | Crossref Full Text | Google Scholar

Borgalli, R. A., and Surve, S. (2022). Review on learning framework for facial expression recognition. Imaging Sci. J. 70, 483–521. doi: 10.1080/13682199.2023.2172526

Crossref Full Text | Google Scholar

Bosquet, B., Cores, D., Seidenari, L., Brea, V. M., Mucientes, M., and Del Bimbo, A. (2023). A full data augmentation pipeline for small object detection based on generative adversarial networks. Pattern Recognit. 133:108998. doi: 10.1016/j.patcog.2022.108998

Crossref Full Text | Google Scholar

Brock, A., Donahue, J., and Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. arXiv [preprint]. doi: 10.48550/arXiv.1809.11096

Crossref Full Text | Google Scholar

Buduma, N., Buduma, N., and Papa, J. (2022). Fundamentals of Deep Learning. Sevastopol, CA: O'Reilly Media, Inc.

Google Scholar

Cai, J., Meng, Z., Khan, A. S., O'Reilly, J., Li, Z., Han, S., et al. (2021). “Identity-free facial expression recognition using conditional generative adversarial network,” in 2021 IEEE International Conference on Image Processing (ICIP) (Anchorage, AK: IEEE), 1344–1348.

Google Scholar

Chadha, G. S., Panambilly, A., Schwung, A., and Ding, S. X. (2020). Bidirectional deep recurrent neural networks for process fault classification. ISA Trans. 106, 330–342. doi: 10.1016/j.isatra.2020.07.011

PubMed Abstract | Crossref Full Text | Google Scholar

Chen, B., Cao, Q., Hou, M., Zhang, Z., Lu, G., and Zhang, D. (2021). Multimodal emotion recognition with temporal and semantic consistency. IEEE/ACM Transact. Audio Speech Lang. Process. 29, 3592–3603. doi: 10.1109/TASLP.2021.3129331

Crossref Full Text | Google Scholar

Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., and Bharath, A. A. (2018). Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35, 53–65. doi: 10.1109/MSP.2017.2765202

Crossref Full Text | Google Scholar

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021). “An image is worth 16x16 words: transformers for image recognition at scale,” in International Conference on Learning Representations.

Google Scholar

Fan, Y., Li, V. O., and Lam, J. C. (2020). Facial expression recognition with deeply-supervised attention network. IEEE Transact. Affect. Comp. 13, 1057–1071. doi: 10.1109/TAFFC.2020.2988264

Crossref Full Text | Google Scholar

Ghorbanali, A., and Sohrabi, M. K. (2023). Exploiting bi-directional deep neural networks for multi-domain sentiment analysis using capsule network. Multimed. Tools Appl. 82, 22943–22960. doi: 10.1007/s11042-023-14449-3

Crossref Full Text | Google Scholar

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2020). Generative adversarial networks. Commun. ACM 63, 139–144. doi: 10.1145/3422622

Crossref Full Text | Google Scholar

Graves, A., and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw. 18, 602–610. doi: 10.1016/j.neunet.2005.06.042

PubMed Abstract | Crossref Full Text | Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Las Vegas, NV), 770–778.

Google Scholar

Jiang, Z., Zhu, Z., and Xia, Y. (2020). Dual attention network for occlusion-aware facial expression recognition. IEEE Transact. Image Process. 29, 4051–4063.

Google Scholar

Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv [preprint]. doi: 10.48550/arXiv.1710.10196

Crossref Full Text | Google Scholar

Karras, T., Laine, S., and Aila, T. (2019). “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Long Beach, CA), 4401–4410.

Google Scholar

Kim, S., Park, J., and Lee, S. (2023). DCNN-RF: deep convolutional neural network with random forest for facial expression recognition. IEEE Access 11, 12345–12356. doi: 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.213

PubMed Abstract | Crossref Full Text | Google Scholar

Kollias, D., and Zafeiriou, S. (2019). Deep neural network augmentation: generating and preserving face dataset variations for boosting facial expression recognition. Image Vis. Comput. 89, 10–20. doi: 10.1007/s11263-020-01304-3

Crossref Full Text | Google Scholar

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 84–90. doi: 10.1145/3065386

Crossref Full Text | Google Scholar

Li, S., Deng, W., and Du, J. (2017). “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2852–2861.

PubMed Abstract | Google Scholar

Li, W., Wang, Y., Li, J., and Luo, J. (2023). “Transfer: transformer-based facial expression recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Vancouver, BC), 2345–2354.

Google Scholar

Li, X., Hong, X., and Moilanen, A. (2019). Facial expression recognition using deep local-based convolutional neural network and multiple kernel learning. Pattern Recognit. 88, 272–282.

Google Scholar

Li, Y., Lu, G., Li, J., Zhang, Z., and Zhang, D. (2020a). Facial expression recognition in the wild using multi-level features and attention mechanisms. IEEE Transact. Affect. Comp. 14, 451–462. doi: 10.1109/TAFFC.2020.3031602

Crossref Full Text | Google Scholar

Li, Y., Zeng, D., Liu, J., et al. (2020b). Occlusion-aware facial expression recognition using cnn with attention mechanism. IEEE Access 8, 28860–28871. doi: 10.1109/TIP.2018.2886767

PubMed Abstract | Crossref Full Text | Google Scholar

Liu, J., Li, S., Song, W., Liu, L., Qin, H., and Hao, A. (2018). “Automatic beautification for group-photo facial expressions using novel Bayesian GANs,” in Artificial Neural Networks and Machine Learning – ICANN 2018, eds. V. Kurková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Cham: Springer), 760–770. doi: 10.1007/978-3-030-01418-6_74

Crossref Full Text | Google Scholar

Liu, L., Lin, L., and Ma, C. (2021). Multi-scale region-based attention network for facial expression recognition. IEEE Transact. Multim. 23, 1–11.

Google Scholar

Ma, F., Sun, B., and Li, S. (2021). Facial expression recognition with visual transformers and attentional selective fusion. IEEE Transact. Affect. Comp. 14, 1236–1248. doi: 10.1109/TAFFC.2021.3122146

Crossref Full Text | Google Scholar

Paladugu, P. S., Ong, J., Nelson, N., Kamran, S. A., Waisberg, E., Zaman, N., et al. (2023). Generative adversarial networks in medicine: important considerations for this emerging innovation in artificial intelligence. Ann. Biomed. Eng. 51, 2130–2142. doi: 10.1007/s10439-023-03304-z

PubMed Abstract | Crossref Full Text | Google Scholar

Pascanu, R., Mikolov, T., and Bengio, Y. (2013). “On the difficulty of training recurrent neural networks,” in International Conference on Machine Learning (PMLR), 1310–1318.

Google Scholar

Peng, J., Wu, T., Zhang, W., Cheng, F., Tan, S., Yi, F., et al. (2023). A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Exp. Syst. Appl. 221:119721. doi: 10.1016/j.eswa.2023.119721

Crossref Full Text | Google Scholar

Poria, S., Peng, H., Hussain, A., Howard, N., and Cambria, E. (2017). Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 261, 217–230. doi: 10.1016/j.neucom.2016.09.117

Crossref Full Text | Google Scholar

Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv [preprint]. doi: 10.48550/arXiv.1511.06434

Crossref Full Text | Google Scholar

Shan, L., and Deng, W. (2018). Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Transact. Image Process. 28, 356–370. doi: 10.1109/TIP.2018.2868382

PubMed Abstract | Crossref Full Text | Google Scholar

Sikkandar, H., and Thiyagarajan, R. (2021). Deep learning based facial expression recognition using improved cat swarm optimization. J. Ambient Intell. Humaniz. Comput. 12, 3037–3053. doi: 10.1007/s12652-020-02463-4

Crossref Full Text | Google Scholar

Singh, R., Saurav, S., Kumar, T., Saini, R., Vohra, A., and Singh, S. (2023). Facial expression recognition in videos using hybrid CNN & ConvLSTM. Int. J. Inf. Technol. 15, 1819–1830. doi: 10.1007/s41870-023-01183-0

PubMed Abstract | Crossref Full Text | Google Scholar

Sun, Y., Zhao, W., and Wu, J. (2021). Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transact. Affect. Comp. 12, 1–12. doi: 10.1109/TAFFC.2017.2740923

PubMed Abstract | Crossref Full Text | Google Scholar

Tang, Y., Mao, X., and Qiao, Y. (2019). Deeply-supervised CNN for facial expression recognition. IEEE Transact. Affect. Comp. 10, 504–517.

Google Scholar

Tariq, M. U., Akram, A., Yaqoob, S., Rasheed, M., and Ali, M. S. (2023). Real time age and gender classification using Vgg19. Adv. Mach. Learn. Artif. Intellig. 4, 56–65. doi: 10.33140/AMLAI

Crossref Full Text | Google Scholar

Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., and Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 11, 1301–1309. doi: 10.1109/JSTSP.2017.2764438

Crossref Full Text | Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008. doi: 10.48550/arXiv.1706.03762

Crossref Full Text | Google Scholar

Wang, K., Peng, X., Yang, J., Meng, D., and Qiao, Y. (2020). Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transact. Image Process. 29, 4057–4069. doi: 10.1109/TIP.2019.2956143

PubMed Abstract | Crossref Full Text | Google Scholar

Wang, X., Ren, Y., Luo, Z., He, W., Hong, J., and Huang, Y. (2023). Deep learning-based eeg emotion recognition: current trends and future perspectives. Front. Psychol. 14:1126994. doi: 10.3389/fpsyg.2023.1126994

PubMed Abstract | Crossref Full Text | Google Scholar

Wang, Y., Gu, Y., Yin, Y., Han, Y., Zhang, H., Wang, S., et al. (2023). Multimodal transformer augmented fusion for speech emotion recognition. Front. Neurorobot. 17:1181598. doi: 10.3389/fnbot.2023.1181598

PubMed Abstract | Crossref Full Text | Google Scholar

Wu, W., Qi, Z., and Fuxin, L. (2019). “PointConv: deep convolutional networks on 3D point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Long Beach, CA), 9621–9630.

PubMed Abstract | Google Scholar

Xu, S., Zhang, W., and Liu, Z. (2022). Cross-dataset facial expression recognition: a comprehensive study. IEEE Transact. Image Process. 31, 515–526.

Google Scholar

Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E., and Morency, L.-P. (2018). “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Melbourne, VIC), 2236–2246.

Google Scholar

Zeng, D., Lin, C., Wang, W., and Zhang, X. (2019). Face2Exp: real-time facial expression reconstruction and recognition with multi-stage training. IEEE Transact. Image Process. 28, 2364–2374. doi: 10.1109/CVPR52688.2022.01965

Crossref Full Text | Google Scholar

Zeng, D., Zhao, Y., and Wang, M. (2020). Real-time facial expression recognition in the wild by selective region ensemble. IEEE Transact. Image Process. 29, 3657–3669.

Google Scholar

Zhang, H., Xu, Z., Shi, Y., and Zhang, Z. (2022). FER-GAN: facial expression recognition via generative adversarial networks. IEEE Transact. Neural Netw. Learn. Syst. 33, 114–123. doi: 10.1109/TNNLS.2021.3132928

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, S., Tong, H., Xu, J., and Maciejewski, R. (2019). Graph convolutional networks: a comprehensive review. Comp. Soc. Netw. 6, 1–23. doi: 10.1186/s40649-019-0069-y

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, Y., Wang, S., and Li, X. (2021). Vision transformer for small-size datasets. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2535–2552. doi: 10.48550/arXiv.2112.13492

PubMed Abstract | Crossref Full Text | Google Scholar

Zhao, W., Yang, J., and Xiao, C. (2020). Transformer-based facial expression recognition in the wild. IEEE Transact. Multim. 22, 1874–1884.

Google Scholar

Zhao, X., Liu, J., and Wang, F. (2023). HRNet-FER: high-resolution network for facial expression recognition. IEEE Transact. Image Process. 32, 467–478.

Google Scholar

Zhong, M.-Y., Yang, Q.-Y., Liu, Y., Zhen, B., Xie, B.-B., et al. (2023). EEG emotion recognition based on TQWT-features and hybrid convolutional recurrent neural network. Biomed. Signal Process. Control 79:104211. doi: 10.1016/j.bspc.2022.104211

Crossref Full Text | Google Scholar

Keywords: affective computing, attention mechanisms, feature extraction, emotion recognition, facial expression recognition, deep learning, transfer learning

Citation: Wang J (2024) Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing. Front. Neurosci. 18:1449527. doi: 10.3389/fnins.2024.1449527

Received: 15 June 2024; Accepted: 11 July 2024;
Published: 07 August 2024.

Edited by:

Hancheng Zhu, China University of Mining and Technology, China

Reviewed by:

S. K. B. Sangeetha, SRM Institute of Science and Technology, India
Rohan Borgalli, University of Mumbai, India
Mohana Murugan, Avinashilingam Institute for Home Science and Higher Education for Women, India
Deepali Virmani, Guru Gobind Singh Indraprastha University, India

Copyright © 2024 Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jingyi Wang, d2FuZzA0MDMyMDIyQDE2My5jb20=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.