- School of Mass-communication and Advertising, Tongmyong University, Busan, Republic of Korea
Facial expression recognition (FER) plays a crucial role in affective computing, enhancing human-computer interaction by enabling machines to understand and respond to human emotions. Despite advancements in deep learning, current FER systems often struggle with challenges such as occlusions, head pose variations, and motion blur in natural environments. These challenges highlight the need for more robust FER solutions. To address these issues, we propose the Attention-Enhanced Multi-Layer Transformer (AEMT) model, which integrates a dual-branch Convolutional Neural Network (CNN), an Attentional Selective Fusion (ASF) module, and a Multi-Layer Transformer Encoder (MTE) with transfer learning. The dual-branch CNN captures detailed texture and color information by processing RGB and Local Binary Pattern (LBP) features separately. The ASF module selectively enhances relevant features by applying global and local attention mechanisms to the extracted features. The MTE captures long-range dependencies and models the complex relationships between features, collectively improving feature representation and classification accuracy. Our model was evaluated on the RAF-DB and AffectNet datasets. Experimental results demonstrate that the AEMT model achieved an accuracy of 81.45% on RAF-DB and 71.23% on AffectNet, significantly outperforming existing state-of-the-art methods. These results indicate that our model effectively addresses the challenges of FER in natural environments, providing a more robust and accurate solution. The AEMT model significantly advances the field of FER by improving the robustness and accuracy of emotion recognition in complex real-world scenarios. This work not only enhances the capabilities of affective computing systems but also opens new avenues for future research in improving model efficiency and expanding multimodal data integration.
1 Introduction
In the field of affective computing, facial expression recognition (FER) has garnered significant attention due to its natural and powerful means of conveying human emotions. FER systems have critical applications in psychology research, human-computer interaction, driver fatigue monitoring, and more. However, there are still many challenges to facial expression recognition in natural environments. Factors such as occlusion, changes in head pose (Sun et al., 2021; Xu et al., 2022), facial distortion and motion blurring exacerbate the challenges to such recognition, as shown in Figure 1. These factors lead to significant changes in facial appearance, complicating the task of accurately recognizing expressions and causing traditional recognition methods in laboratory settings to perform poorly in real-world applications (Borgalli and Surve, 2022). Therefore, how to achieve efficient and accurate facial expression recognition in complex environments has become an urgent problem in the field (Zeng et al., 2019; Li et al., 2020b).
Figure 1. Samples from the AffectNet (Sun et al., 2021) and RAF-DB (Li et al., 2017) datasets, emphasizing the variations in head poses, occlusions, and other unconstrained conditions present in real-world images. AffectNet includes eight expression labels, incorporating the contempt category, while RAF-DB is annotated with seven basic expressions and additional compound expressions.
The advent of deep learning has provided new opportunities for FER. Convolutional neural networks (CNNs) and other deep learning models have made significant strides in feature extraction and classification accuracy. Deep learning models automatically learn complex features from data, enhancing the accuracy and robustness of FER. For example, Tang et al. (2019) proposed a CNN model that significantly improved performance by replacing the softmax layer with a linear support vector machine (SVM) for classification. Similarly, Kim et al. developed a deep locality-preserving CNN (DCNN-RF) method to enhance feature discriminativeness (Li et al., 2019; Kim et al., 2023). Despite these advancements, the performance of deep learning methods in natural environments still leaves much to be desired (Kollias and Zafeiriou, 2019).
Currently, the application of deep learning in natural environments faces several challenges, including insufficient data, weak model generalization, and difficulty in feature extraction under complex conditions. Most existing methods are trained and tested in controlled environments, performing poorly in real-world scenarios. Additionally, the limited quantity and quality of available datasets hinder the effective training of deep learning models, resulting in unstable performance in natural environments (Wang X. et al., 2020; Zeng et al., 2020). Existing methods often fail to account for the diversity of real-world conditions, such as varying lighting, occlusions, and head poses, leading to reduced robustness and accuracy.
To address these challenges, this paper proposes an improved visual Transformer model that combines attention mechanisms and multi-layer Transformer encoders, incorporating transfer learning to leverage the advantages of pre-trained models on large-scale datasets (Liu et al., 2021). Specifically, the proposed method involves two main steps: first, using a dual-branch CNN to extract RGB and LBP (Local Binary Pattern) features, which are then fused using an ASF module. The ASF module integrates global and local attention mechanisms to effectively combine various features, enhancing feature representation richness (Zhao et al., 2020; Zhang et al., 2021). Second, a multi-layer Transformer encoder models the global relationships of the fused features, and the pre-trained model is fine-tuned to improve adaptability to new datasets. The Transformer encoder, through multi-head self-attention mechanisms, captures long-range dependencies among features, thereby improving recognition capabilities (Ma et al., 2021).
The proposed model addresses the limitations of existing methods by enhancing feature extraction and improving generalization. The dual-branch CNN captures both color and texture information through RGB and LBP features, addressing the issue of insufficient feature representation. The ASF module further enhances this by selectively focusing on the most relevant features, improving the model's ability to handle occlusions and varying head poses. The multi-layer Transformer encoder with transfer learning leverages pre-trained models to improve performance on smaller datasets, addressing the challenge of insufficient training data and enhancing model generalization.
The goal of this study is to improve the accuracy and robustness of FER in natural environments by combining attention mechanisms, transfer learning, and Transformer models, providing an effective solution for affective computing. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods on multiple public datasets, achieving new performance benchmarks. For instance, testing on the RAF-DB, FERPlus, and AffectNet datasets shows that the proposed method surpasses existing methods in accuracy, achieving new performance highs. Furthermore, the proposed method exhibits excellent generalization capabilities in cross-dataset evaluations, validating its applicability in diverse environments (Jiang et al., 2020).
In summary, this paper introduces a novel FER method that leverages transfer learning and improved attention mechanisms. This approach not only enhances recognition accuracy but also improves robustness and generalization in complex environments, providing new insights and technical support for the development of affective computing. With the advent of larger datasets and more powerful computational resources, this method is expected to further advance, laying the groundwork for more intelligent and humanized affective computing systems.
In conclusion, our contributions are as follows:
1. Novel integration of attention mechanisms and transformers: We have developed a new model that integrates attention mechanisms with multi-layer Transformer encoders. This combination enhances the ability to capture global and local features, improving the accuracy and robustness of facial expression recognition in natural environments.
2. Incorporation of transfer learning: By incorporating transfer learning, our model leverages pre-trained features from large-scale datasets, significantly improving performance and training efficiency on smaller, task-specific datasets. This approach also enhances the model's adaptability to diverse data conditions.
3. Comprehensive evaluation and validation: We conducted extensive experiments across multiple public datasets (RAF-DB, FERPlus, and AffectNet), demonstrating that our proposed method achieves state-of-the-art performance. Additionally, we validated our model's generalization capabilities through cross-dataset evaluations, proving its effectiveness in real-world applications.
To provide a clear structure for the reader, we outline the organization of our paper as follows: The first section is the introduction, providing an overview of the research background and the main challenges addressed. The second section reviews related work, extending the discussion on the application of models in similar fields. The third section, Method, describes the models and algorithms used in our study. The fourth section presents our experiments, evaluating our proposed research from various perspectives and comparing its performance with other studies. Finally, Section 5 summarizes our findings and discusses future directions for research.
2 Related work
2.1 Convolutional neural networks for facial expression recognition
CNNs have shown exceptional performance in visual perception tasks, particularly in facial expression recognition. CNNs effectively extract features from images through hierarchical convolution and pooling operations and classify these features. Typical CNN architectures such as AlexNet, VGGNet, and ResNet have been widely applied to facial expression recognition tasks (Krizhevsky et al., 2012; He et al., 2016). Tariq et al. utilized VGGNet for facial expression classification, achieving high recognition accuracy on the FER-2013 dataset (Sikkandar and Thiyagarajan, 2021; Tariq et al., 2023).
In their implementation, the researchers first preprocessed the FER-2013 dataset by resizing the images to a fixed size and then used the VGGNet model to extract image features. By fine-tuning and optimizing the model, they classified seven basic emotions (e.g., happiness, sadness, anger). The experimental results showed that the VGGNet-based model achieved over 70% accuracy on the test set, significantly outperforming traditional handcrafted feature extraction methods.
The advantage of CNNs lies in their automatic feature extraction capability, making them particularly effective in handling complex emotional expressions. However, CNN models are highly dependent on datasets and require a large amount of labeled data for training (Buduma et al., 2022). Additionally, CNNs are sensitive to geometric transformations of input images (e.g., rotation, scaling), making them susceptible to image preprocessing quality (Wu et al., 2019). Another issue is that CNN models generally have a large number of parameters, requiring substantial computational resources for training and posing challenges for deployment in resource-limited environments (Zhang et al., 2019).
2.2 Recurrent neural networks in dynamic emotion analysis
Recurrent Neural Networks (RNNs) and their variants, Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), have advantages in handling time-series data and are widely used in dynamic emotion analysis. RNNs capture temporal relationships in sequential data, making them effective for recognizing emotions in continuous video frames (Ghorbanali and Sohrabi, 2023; Zhong et al., 2023).
Zhang et al. utilized an LSTM model to model facial expression sequences in videos and conducted experiments on the CK+ dataset. The results showed that LSTM outperformed traditional methods in capturing emotional changes (Singh et al., 2023). In their experiment, the researchers used the CK+ dataset, which contains temporal data of various facial expressions. By extracting video frames and inputting them into the LSTM model, the model learned the dynamic features of facial expressions over time (Chadha et al., 2020). The experimental results showed that the LSTM model effectively captured subtle emotional changes, achieving high accuracy (Singh et al., 2023).
Although RNNs perform well in dynamic emotion analysis, they have some drawbacks, such as gradient vanishing and exploding problems during training (Pascanu et al., 2013). Additionally, RNNs are sensitive to noise in the data, posing challenges for practical applications (Graves and Schmidhuber, 2005). Future research could focus on addressing these issues, such as improving model architectures or using data augmentation techniques to enhance model robustness.
2.3 Generative adversarial networks for data augmentation
Generative Adversarial Networks (GANs) have achieved remarkable results in various computer vision tasks, particularly in data augmentation for facial expression recognition. GANs, through adversarial training between a generator and a discriminator, can generate realistic facial expression images, thus addressing the issue of insufficient real data (Radford et al., 2015; Creswell et al., 2018). Goodfellow et al. introduced GANs in their seminal work, demonstrating their capability in image generation (Goodfellow et al., 2020; Bosquet et al., 2023).
Liu et al. used GANs to generate synthetic facial expression images and combined them with real data to train CNN models, significantly improving recognition accuracy (Liu et al., 2018; Cai et al., 2021). In their experiment, the researchers first trained a GAN generator to produce various facial expression images, then mixed these generated images with real data to train CNN models (Karras et al., 2019). This approach significantly enhanced dataset diversity, optimizing model performance on the FER-2013 dataset, with accuracy improvements of around 5% (Cai et al., 2021).
Although GANs are effective in data augmentation, their training process is challenging. GAN training is unstable and prone to mode collapse, where the generator only produces a limited variety of samples (Paladugu et al., 2023). Furthermore, the quality of GAN-generated samples heavily depends on the generator's design and training quality, and improper hyperparameter settings can lead to low-quality samples (Brock et al., 2018). Future research can improve GAN stability and sample quality by refining training algorithms and model architectures (Karras et al., 2017).
2.4 Multimodal deep learning in affective computing
Multimodal deep learning combines information from different modalities (e.g., visual, audio, text) to enhance affective computing capabilities (Baltrušaitis et al., 2018; Chen et al., 2021). In facial expression recognition, visual information is often combined with audio information to improve emotion recognition accuracy (Tzirakis et al., 2017). Poria et al. developed a multimodal emotion recognition system that uses CNN to extract facial expression features, RNN to extract audio features, and a fusion network to combine these features for emotion classification (Poria et al., 2017; Wang Y. et al., 2023).
In their experiments, the researchers used a multimodal dataset that included both video and audio data. By extracting visual and audio features separately and combining them in a fusion network, the researchers achieved more accurate emotion recognition. The experimental results showed that multimodal systems outperformed unimodal systems in emotion recognition tasks, significantly improving accuracy (Peng et al., 2023).
Multimodal deep learning systems excel in affective computing due to their ability to utilize information from different modalities, providing a more comprehensive emotional analysis (Zadeh et al., 2018). However, their implementation complexity is high, involving complex processes for collecting and synchronizing multimodal data (Wang et al., 2023). Additionally, multimodal systems face challenges in real-world applications due to data inconsistency, such as missing or poor-quality audio and video data, which can affect model robustness (Aslam et al., 2023).
3 Method
3.1 Overview of our network
Our proposed model, the Attention-Enhanced Multi-Layer Transformer (AEMT) Model, integrates several advanced components to enhance performance in natural environments for FER. The model comprises a dual-branch Convolutional Neural Network (CNN), an ASF module, and a multi-layer Transformer encoder with transfer learning.
The dual-branch CNN includes one branch dedicated to extracting features from RGB images, capturing color and texture information crucial for identifying facial expressions, and another branch for extracting Local Binary Pattern (LBP) features, which are effective in capturing fine-grained texture details and robust to lighting variations. The ASF module dynamically fuses the features extracted by the dual-branch CNN using global and local attention mechanisms to prioritize and combine the most relevant features, enhancing the richness and relevance of the combined feature representation. The fused features are then fed into a multi-layer Transformer encoder, which leverages multi-head self-attention mechanisms to model the long-range dependencies and global relationships between features, improving the model's ability to understand complex facial expressions. Additionally, transfer learning is incorporated by utilizing pre-trained weights, which are fine-tuned on the FER dataset to adapt to the specific task.
The ASF module dynamically fuses the features extracted by the dual-branch CNN using global and local attention mechanisms to prioritize and combine the most relevant features, enhancing the richness and relevance of the combined feature representation. The attention mechanisms in the ASF module calculate attention weights that determine the contribution of each feature map. Key hyperparameters include the number of attention heads, the dimensionality of the feature maps, and the attention function parameters.
The fused features are then fed into a multi-layer Transformer encoder, which leverages multi-head self-attention mechanisms to model the long-range dependencies and global relationships between features, improving the model's ability to understand complex facial expressions. The Transformer encoder consists of multiple layers, each with self-attention and feed-forward networks. Hyperparameters include the number of layers, number of attention heads, and the size of each feed-forward network.
Additionally, transfer learning is incorporated by utilizing pre-trained weights, which are fine-tuned on the FER dataset to adapt to the specific task. This involves selecting a pre-trained Transformer model, typically trained on large datasets such as ImageNet, and fine-tuning it on FER-specific data. Hyperparameters for transfer learning include the learning rate, batch size, and number of fine-tuning epochs.
The model starts by taking pre-processed facial images as input, which are resized and normalized to ensure consistency. The input images are then passed through the dual-branch CNN. One branch processes the RGB images, extracting deep color and texture features using convolutional layers, while the other branch processes the same images to extract LBP features, emphasizing local texture patterns. The ASF module receives the features from both CNN branches and applies attention mechanisms to weigh and combine these features, producing a fused feature map that encapsulates both global and local facial information. The fused feature map is flattened and transformed into a sequence of visual tokens, which are then fed into the multi-layer Transformer encoder. This encoder applies self-attention and feed-forward networks across multiple layers to capture intricate relationships between the tokens. The pre-trained Transformer model is fine-tuned on the specific FER dataset to improve performance. Finally, the encoded features from the Transformer are passed through a fully connected layer, and the output layer, equipped with a softmax function, generates the probability distribution over the facial expression categories, producing the final prediction.
The following figure illustrates the structure of our proposed AEMT model, highlighting the integration of the dual-branch CNN, ASF module, and multi-layer Transformer encoder with transfer learning.
As shown in the Figure 2, the dual-branch CNN ensures comprehensive feature extraction, capturing both detailed texture and broader color information. The ASF module further enhances this by selectively emphasizing the most relevant features through attention mechanisms. The multi-layer Transformer encoder, depicted in the diagram, excels at modeling long-range dependencies and complex relationships between features, which is crucial for accurately interpreting subtle and dynamic facial expressions. By incorporating transfer learning, the model benefits from pre-trained weights on large-scale datasets, improving its performance on smaller, task-specific datasets. This enhances the model's robustness and adaptability to diverse and unconstrained environments. Leveraging pre-trained models reduces the need for extensive training data and computational resources. The attention mechanisms ensure that the model focuses on the most informative parts of the input, improving both training efficiency and inference accuracy. In summary, as illustrated, our method combines the strengths of CNNs, attention mechanisms, and Transformers with transfer learning to create a robust and effective FER system. Through extensive evaluation, we demonstrate its superior performance and adaptability in real-world scenarios, paving the way for more advanced and reliable affective computing applications.
3.2 Attentional selective fusion module
The ASF module is a pivotal component in our model, designed to dynamically integrate features from different sources. Its basic principle involves using attention mechanisms to prioritize and combine the most relevant features extracted by the dual-branch CNN, specifically from the RGB and LBP branches. This selective attention ensures that the fused feature representation retains critical information while filtering out less relevant data, thereby enhancing the model's performance in recognizing facial expressions. The ASF module's role is particularly significant because it bridges the gap between feature extraction and high-level semantic understanding, making it an essential part of the model's overall architecture.
The ASF module consists of several key components and hyperparameters. Firstly, it extracts feature maps from the RGB and LBP branches of the dual-branch CNN. The RGB branch captures detailed color and texture information, essential for distinguishing different facial expressions, while the LBP branch extracts fine-grained texture details, which are robust to variations in lighting conditions. The attention weights αRGB and αLBP are then computed using a softmax function to ensure they sum to one, involving learnable parameters WRGB and WLBP, which are optimized during training to balance the contributions of each feature map.
Once the attention weights are determined, the ASF module fuses the feature maps using these weights to create a combined feature map Ffused. This fusion emphasizes the most relevant features while minimizing the impact of less important ones. The fused feature map is then normalized to ensure consistency and prepare it for further processing by the Transformer encoder. Normalization methods such as batch normalization or layer normalization are applied, with specific parameters computed during training to maintain stability.
The final step involves transforming the normalized feature map into a sequence of visual tokens that the Transformer encoder can process. This transformation ensures the features are in a suitable format for the attention mechanisms within the Transformer, using a tokenization strategy that determines how the feature map is divided into tokens and adding positional encoding to preserve spatial relationships.
In practical applications, the ASF module proves to be highly beneficial. For instance, in human-computer interaction systems, accurately recognizing a user's facial expressions is crucial for providing appropriate responses. The ASF module helps in capturing subtle facial cues that convey emotions, thereby improving the system's ability to interpret and respond to user emotions correctly. In driver monitoring systems, where recognizing fatigue and distraction through facial expressions can prevent accidents, the ASF module's ability to focus on the most informative features under varying lighting conditions and partial occlusions ensures reliable performance (Zhao et al., 2020). Similarly, in psychological research, where detailed analysis of facial expressions is necessary, the ASF module aids in extracting fine-grained features that are critical for studying emotional responses.
The use of attention mechanisms in the ASF module has become increasingly popular in the field of facial expression recognition. Traditional methods often struggle with the variability in facial expressions due to differences in lighting, occlusions, and individual facial features. Attention mechanisms, like those in the ASF module, address these challenges by selectively focusing on the most relevant parts of the feature maps (Sun et al., 2021). This selective focus helps in capturing the essential details needed for accurate recognition. In recent years, several studies have demonstrated the effectiveness of attention-based models in enhancing the performance of FER systems, making them more robust and accurate.
In our proposed AEMT model, the ASF module plays a crucial role in bridging the gap between feature extraction and the Transformer encoder. It receives feature maps from the dual-branch CNN, where one branch processes RGB images to capture color and texture information, and the other branch processes LBP images to capture fine-grained texture details. The ASF module calculates attention weights for each feature map, ensuring that the most informative features are emphasized in the fused representation. This fused feature map is then passed to the multi-layer Transformer encoder, which further processes the data to recognize facial expressions. By effectively combining the strengths of CNNs in feature extraction with the powerful sequence modeling capabilities of Transformers, the ASF module ensures that the overall model can accurately capture and interpret complex facial expressions. The attentional selective fusion is illustrated in Figure 3 below:
The calculation of attention weights in the ASF module is essential to its function. Let FRGB and FLBP be the feature maps from the RGB and LBP branches, respectively. The attention weights αRGB and αLBP are computed as follows:
where αRGB and αLBP are the attention weights for the RGB and LBP feature maps, respectively; WRGB and WLBP are learnable parameters that adjust the contribution of each feature map.
Once the attention weights are determined, the ASF module fuses the feature maps using these weights. The fused feature map Ffused is given by:
where Ffused represents the combined feature map that incorporates the most significant aspects of both input feature maps.
The fused feature map is then normalized to ensure consistency and to prepare it for further processing by the Transformer encoder. This normalization is achieved by applying a normalization function N to Ffused:
where N denotes the normalization function that standardizes the feature values.
The final step involves transforming the normalized feature map into a sequence of visual tokens, which the Transformer encoder can process. This transformation is represented as:
where T is the transformation function that converts the normalized feature map into visual tokens Tinput.
The ASF module is integral to the AEMT model, enhancing its ability to focus on the most relevant features extracted by the dual-branch CNN. By dynamically adjusting the attention weights and fusing the feature maps, the ASF module ensures that the subsequent processing stages receive high-quality, informative data. This contributes significantly to the model's overall performance, making it more accurate and robust in facial expression recognition tasks.
3.3 Multi-layer transformer encoder with transfer learning
The Multi-Layer Transformer Encoder with Transfer Learning is a core component of our AEMT model, specifically designed to process and refine the fused feature representations from the ASF module. The fundamental principle of the Transformer encoder lies in its ability to capture long-range dependencies and global relationships within the input data through self-attention mechanisms. This capability is crucial for understanding complex and subtle facial expressions, which may be distributed across different regions of the face.
Transformers have been widely adopted in various fields, including natural language processing and computer vision, due to their superior performance in capturing contextual information (Vaswani et al., 2017). In facial expression recognition, the use of Transformer encoders enables the model to understand intricate patterns and relationships between different facial features, leading to more accurate and robust predictions (Dosovitskiy et al., 2021). Moreover, incorporating transfer learning allows the model to leverage pre-trained weights from large-scale datasets, significantly improving its performance on smaller, task-specific datasets like those used in FER. This approach not only enhances the model's accuracy but also accelerates the training process, making it more efficient and practical for real-world applications.
In the context of our AEMT model, the Multi-Layer Transformer Encoder with Transfer Learning plays a critical role in processing the fused feature map provided by the ASF module. After receiving the fused features, the Transformer encoder applies a sequence of self-attention and feed-forward layers to model the complex relationships and dependencies within the data. This process begins with the transformation of the normalized feature map into a sequence of visual tokens, which are then fed into the Transformer encoder.
The MTE component consists of several key elements and hyperparameters that contribute to its effectiveness. The input layer Lin is responsible for initial processing and normalization of the input data. The body of the encoder, comprising multiple streams, employs self-attention mechanisms to capture long-range dependencies and global relationships. Each stream processes a portion of the data independently, and the outputs are combined to form a cohesive representation. The output layer Lout consolidates the information and prepares it for the final prediction stage.
Key hyperparameters include the number of attention heads h, the dimension of the keys dk, and the number of layers in the encoder. These parameters are tuned to balance computational efficiency and model performance. The number of attention heads h allows the model to focus on different aspects of the input data simultaneously, enhancing its ability to capture complex patterns. The dimension of the keys dk determines the granularity of the attention mechanism, and the number of layers in the encoder affects the model's capacity to learn hierarchical representations.
As a starting point, we use the vanilla Transformer model. We modify its encoder portion by splitting it into three segments: the input layer Lin, the body of the encoder with multiple streams, and the output layer Lout. We denote Si as the i-th stream with output Zi. The body of the encoder consists of multiple parallel streams, each processing a portion of the data independently before combining their outputs. This architecture is illustrated in Figure 4.
Figure 4. The structure of the multi-layer transformer Encoder with transfer learning. The diagram shows the input layer, multiple parallel streams within the encoder body, and the output layer, highlighting the use of skip connections and the integration of pre-trained weights. (A) Baseline. (B) Multi stream.
The self-attention mechanism in the Transformer encoder operates by calculating attention scores between each pair of tokens, allowing the model to weigh the importance of each token in relation to others. The attention score for a token i with respect to token j is computed as follows:
where Qi (queries), Kj (keys), and Vj (values) are projections of the input token, and dk is the dimension of the keys. The softmax function ensures that the attention scores are normalized.
The multi-head self-attention mechanism extends this concept by computing multiple attention scores in parallel, providing the model with diverse perspectives on the data. The output of the multi-head attention mechanism is given by:
where headi represents the attention output from the i-th head, and WO is a learnable weight matrix.
Following the multi-head self-attention, the Transformer encoder applies a position-wise feed-forward network to each token. This network consists of two linear transformations with a ReLU activation in between:
where W1 and W2 are weight matrices, and b1 and b2 are biases. The feed-forward network enhances the model's ability to capture complex patterns in the data.
Each sub-layer in the Transformer encoder, including the self-attention and feed-forward networks, is followed by layer normalization and residual connections, which help stabilize training and improve convergence:
where LayerNorm denotes layer normalization, and SubLayer represents either the self-attention or feed-forward network.
To incorporate transfer learning, the pre-trained Transformer model is fine-tuned on the FER dataset. This involves adjusting the weights of the model through additional training, allowing it to better capture the nuances of facial expressions in the dataset. The fine-tuning process can be represented as:
where θ are the model parameters, is the FER dataset, and is the loss function. Fine-tuning optimizes the model parameters to minimize the loss on the specific task.
The Multi-Layer Transformer Encoder with Transfer Learning is a crucial element of the AEMT model. This component harnesses the capabilities of self-attention mechanisms to discern complex relationships within the data, significantly boosting the model's performance by incorporating transfer learning. By adeptly processing the fused features generated by the ASF module, it guarantees that the final predictions are precise and dependable, thereby greatly enhancing the model's efficacy in facial expression recognition.
4 Experiment
4.1 Datasets
To evaluate the performance of our proposed FER system, we selected the RAF-DB and AffectNet datasets. These datasets are widely recognized in the field of affective computing for several reasons. First, they offer extensive coverage of diverse emotional expressions captured in real-world conditions, which is crucial for testing the robustness of FER systems. Second, both datasets are large-scale, with AffectNet containing over 1 million images and RAF-DB comprising nearly 30,000 images, providing a substantial amount of data for training and evaluation. Third, these datasets are well-annotated, with emotion labels that have been verified by multiple annotators, ensuring high-quality ground truth for model training and testing. Finally, RAF-DB and AffectNet are widely used in academic research, making them standard benchmarks for evaluating FER systems. By choosing these datasets, we aim to demonstrate the robustness and accuracy of our model in handling a wide range of facial expressions under various challenging conditions such as occlusions, head pose variations, and different lighting scenarios. Achieving high accuracy on these datasets indicates that our model can effectively generalize to real-world applications, making it a reliable solution for practical affective computing tasks. Both datasets will be described below.
4.1.1 AffectNet
The AffectNet database is a large-scale image database for emotion computation and facial expression recognition, created by Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor in 2017. It crawls over 1 million emotionally labeled facial images from the Internet using a variety of search engines and keywords. Multiple languages and cultural backgrounds are covered in the database, enhancing diversity.
The AffectNet database is divided into a training set containing 287,401 labeled images and a validation set containing 4,000 images. Each image has manually labeled emotion labels in eight categories: neutral, happiness, sadness, surprise, fear, disgust, anger, and contempt. In addition, each image contains facial keypoint coordinates, facial bounding boxes, and emotion intensity scores (Valence and Arousal).
The AffectNet database is widely used in the fields of affective computing, human-computer interaction, and mental health. It can be used for research and development of affective computing models, including emotion recognition, emotion generation, and emotion enhancement applications; to enhance the emotion-awareness of human-computer interaction systems, such as intelligent customer service and emotional robots; and for mental health monitoring and intervention, to help identify and assess an individual's emotional state. As an important resource for emotion computing and face expression recognition, the AffectNet database provides benchmarking for emotion computing and face expression recognition, and researchers can use the database to evaluate and compare the performance of different models. The database has been cited and used in several academic papers, making it an important resource in emotion computing research.
4.1.2 RAF-DB
RAF-DB (Real-world Affective Faces Database) is a database dedicated to affective computing and face expression recognition, created by Minglei Shu, Shiguang Shan, and Xilin Chen at the University of Nottingham, UK. The database is mainly used to study face expression recognition in real-world environments, aiming to overcome the limitations of traditional laboratory setup databases in practical applications. Images are sourced from a wide range of sources, including the Internet and photographs from daily life, ensuring the diversity and realism of the data. RAF-DB contains 29,672 face images, which have been rigorously screened to ensure the quality and accuracy of the emotional expressions.
Each image is annotated with emotion labels from multiple annotators, which are categorized into seven basic emotion categories: Happy, Angry, Disgust, Fear, Sad, Surprise, and Neutral. In addition, there are eleven composite emotion categories, such as Happily Surprised and Sadly Angry, which reflect more diverse and complex emotional expressions. The database also provides information on facial key points (e.g., locations of eyes, nose, and mouth) and facial bounding boxes, which facilitates researchers to conduct more in-depth feature extraction and analysis. The annotation process employs strict quality control measures, including multiple calibration and consistency checks, to ensure the accuracy and reliability of the annotation.
The diversity of RAF-DB is reflected in many aspects such as gender, age, race and shooting conditions. It contains images with different lighting, pose and expression intensity, which makes model training more challenging and realistic. The database is widely used in the fields of affective computing, human-computer interaction, and mental health monitoring, providing a valuable data resource for developing more accurate and robust emotion recognition systems. By achieving high accuracy on RAF-DB, our model demonstrates its effectiveness in dealing with real-world variations and challenges in facial expression recognition.
4.2 Experimental details
4.2.1 Experimental environment
Our experiments were conducted in the following software and hardware environment. The software environment includes the operating system, deep learning framework, and related libraries. The operating system is Ubuntu 20.04 LTS. PyTorch 1.8.1 was selected as the deep learning framework, mainly because of its flexible dynamic computational graph and strong community support. CUDA 11.2 and cuDNN 8.1 are used to accelerate the training process of deep learning models on NVIDIA GPUs. We use Python 3.8.5 as the programming language, and other key libraries such as NumPy 1.19.2, SciPy 1.6.2, OpenCV 4.5.1, and scikit-learn 0.24.1. NumPy and SciPy are used for data processing and scientific computing, OpenCV is used for image processing, and scikit-learn is used for data preprocessing and performance evaluation.
In terms of hardware environment, our experiments were conducted on a high-performance computing platform. The processor is Intel Xeon E5-2698 v4 @ 2.20 GHz and the memory is 256 GB DDR4 RAM, which ensures the stability and speed of calculation during data preprocessing and model training. We use 4 NVIDIA Tesla V100 GPUs, each with 32 GB of video memory, which greatly accelerates the training process of deep learning models and ensures that we can handle high-resolution images and complex model structures. For storage, we use 2TB NVMe SSD to ensure the efficiency of data reading and writing.
Through the combination of the above software and hardware environment, we can conduct experiments efficiently and stably to verify the models and methods we proposed. Such a powerful experimental environment ensures that we can quickly process large-scale data and complete complex model training and evaluation in a short time, providing reliable support for research.
4.2.2 Model training
Data preprocessing
In the data preprocessing phase, we applied several techniques to ensure the quality and consistency of the input data. First, all input images were resized to 224 × 224 pixels to maintain uniformity across the dataset. We then normalized the pixel values to a range of [0, 1] by dividing by 255. Data augmentation methods such as random cropping, rotation, and horizontal flipping were employed to increase the diversity of training samples and enhance the model's robustness to variations in facial expressions. Additionally, we applied histogram equalization to improve the contrast of the images, making it easier for the model to detect facial features under different lighting conditions. These preprocessing steps ensured that the input data was of high quality and suitable for training the deep learning models.
Network parameter settings
In terms of network parameter settings, we meticulously tuned the model's training parameters. The model employs the Adam optimizer with an initial learning rate set to 0.001. To ensure training stability, we used a learning rate decay strategy, reducing the learning rate by a factor of 0.1 every 10 epochs. The batch size was set to 32 to balance training stability and GPU utilization. Weight decay was set at 0.0005 to prevent overfitting.
Handling class imbalance
To address the class imbalance present in the facial expression datasets, we adopted several techniques during the data preprocessing phase. We applied data augmentation methods such as random cropping, rotation, and horizontal flipping to increase the diversity of the training samples. This helped to ensure that the model was exposed to a wide variety of examples, thereby improving its ability to generalize to new, unseen data. Additionally, we implemented oversampling techniques for underrepresented classes, which involved duplicating instances of these classes to increase their representation in the training set. Conversely, we used undersampling for overrepresented classes, reducing their number to prevent them from dominating the learning process. These resampling strategies ensured a more balanced distribution of training examples, allowing the model to learn equally from all classes. Collectively, these techniques mitigated the class imbalance issue, improving the model's performance and robustness in recognizing various facial expressions.
Addressing overfitting
To prevent overfitting during the training and fine-tuning phases, we employed several strategies. We used data augmentation techniques such as random cropping, rotation, and horizontal flipping to increase the diversity of the training data. This helped the model generalize better to new, unseen data by exposing it to a wider variety of examples. Additionally, we incorporated regularization methods, including weight decay (L2 regularization) and Dropout, to prevent the model from becoming too complex and overfitting the training data. The weight decay was set to 0.0005 to penalize large weights, and Dropout was applied with a rate of 0.5 during training to randomly omit certain neurons, thereby reducing reliance on specific features. We also monitored the performance on the validation set during training and employed an early stopping strategy. Training was terminated if the validation loss did not improve for a specified number of epochs, preventing the model from continuing to train on noise and overfitting. These measures collectively enhanced the model's ability to generalize to new data and improved its overall robustness.
Model architecture design
Our model architecture design includes several key components. First, input images are resized to 224 × 224 and processed through a dual-branch CNN for feature extraction. One branch handles RGB images, while the other processes LBP images. The extracted features are fused using the ASF module, which employs global and local attention mechanisms to select and combine the most relevant features. The fused features are then input into a 6-layer MTE, with each layer containing eight attention heads. The final features are passed through a fully connected layer to output the probability distribution of facial expressions.
Model training process
The model training process is divided into several stages. In the initial stage, we pre-trained the model on the AffectNet dataset, using 80% of the data for training and 20% for validation. The pre-training process consisted of 50 epochs, during which the model performed forward and backward propagation on the training set, calculating the loss using the cross-entropy loss function and updating parameters accordingly. Next, we fine-tuned the model on the RAF-DB dataset, also using 80% of the data for training and 20% for validation. During the fine-tuning stage, we trained the model for 30 epochs, evaluating its performance on the validation set at the end of each epoch to monitor for overfitting. Throughout the training process, we employed data augmentation techniques such as random cropping, rotation, and horizontal flipping to enhance the model's robustness.
Through meticulously tuned network parameter settings, a well-designed model architecture, and a systematic training process, our model demonstrated excellent performance across multiple datasets, validating its effectiveness in facial expression recognition tasks.
4.2.3 Model validation and tuning
Cross-validation
To ensure the robustness and generalizability of our model, we performed k-fold cross-validation during the training process. Specifically, we used 5-fold cross-validation, where the dataset was split into five equal parts. In each iteration, four parts were used for training and one part was used for validation, and this process was repeated five times, ensuring that each part was used for validation exactly once. This approach helps to mitigate the risk of overfitting and provides a comprehensive evaluation of the model's performance. The average accuracy and standard deviation across the five folds were calculated to assess the model's stability and reliability. For instance, during cross-validation on the AffectNet dataset, the model achieved an average accuracy of 71.23% with a standard deviation of 0.85%, demonstrating its consistency across different subsets of the data.
Model fine-tuning
Following the cross-validation, we proceeded to fine-tune the model to further enhance its performance. Fine-tuning was conducted by adjusting hyperparameters and optimizing the model based on the cross-validation results. Specifically, the learning rate was fine-tuned within a range of 0.0001–0.001, and batch sizes were adjusted between 16 and 64 to identify the optimal settings. Additionally, dropout rates were fine-tuned to balance model complexity and prevent overfitting, with dropout values ranging from 0.3 to 0.5. The fine-tuning process also involved monitoring validation loss and accuracy, implementing early stopping if the validation performance plateaued for more than 10 epochs. This approach ensured that the model remained efficient and did not overfit to the training data. After fine-tuning, the final model achieved an improved accuracy of 73.56% on the RAF-DB validation set, reflecting the effectiveness of the tuning process in enhancing model performance.
4.3 Experimental results and analysis
4.3.1 Time complexity analysis
We analyzed the time complexity of our proposed method by examining each component of the model, including the dual-branch CNN, the Attentional Selective Fusion (ASF) module, and the Multi-Layer Transformer Encoder (MTE). The dual-branch CNN involves standard convolutional operations, with a time complexity of O(n2·d·k2) for each convolutional layer, where n is the input size, d is the depth, and k is the kernel size. The ASF module, which combines features using attention mechanisms, has a complexity of O(n2) due to the computation of attention weights. The MTE, which employs multi-head self-attention, has a complexity of O(n2·d) per attention head, with h heads leading to O(n2·d·h).
Compared to state-of-the-art techniques, our model's complexity is slightly higher due to the combination of multiple advanced components. However, by leveraging parallel computation and optimized model architecture, we were able to achieve significant computational efficiency. Our experimental setup, utilizing NVIDIA Tesla V100 GPUs, enabled us to handle the increased complexity effectively, ensuring that training and inference times remained practical for real-world applications. We conducted benchmark comparisons with other methods, demonstrating that our model achieves superior accuracy with a manageable increase in computational overhead.
To provide a clearer comparison, we have included a table that contrasts the computational efficiency of our proposed method with several state-of-the-art techniques. The table below summarizes the time complexity and actual inference time on a standard dataset for each method.
In the Table 1, the “Inference Time” column represents the average time taken to process a single image during inference on the RAF-DB dataset using an NVIDIA Tesla V100 GPU. The “Accuracy” column shows the model's accuracy on the same dataset. Our method demonstrates a slight increase in inference time compared to FER-GAN and HRNet-FER but achieves higher accuracy, indicating a good balance between computational efficiency and performance.
Through this analysis, we show that although our method involves higher complexity, it remains computationally feasible and provides superior performance, making it a robust choice for practical applications in facial expression recognition.
4.3.2 Handling variations in face rotation, occlusion, and lighting
To evaluate the robustness of our proposed model under different conditions, we conducted extensive experiments to test its performance on variations in face rotation angles, different percentages of occlusion, and varying lighting conditions. These experiments were performed using the RAF-DB and AffectNet datasets, which include images with diverse conditions.
Face rotation angles: We tested the model on images with varying degrees of rotation, from −30° to +30°. The results showed that our model maintained a high accuracy of 84.23% on average across these rotations. However, when the face rotation angle exceeded ±30°, the accuracy dropped significantly to below 70%, indicating that extreme rotations negatively impact the model's performance.
Occlusion: To assess the model's performance under occlusion, we artificially occluded different parts of the face (e.g., eyes, mouth) with varying percentages (10, 25, 50%). The model achieved an average accuracy of 81.67% under 25% occlusion. However, when the occlusion percentage reached 50%, the model's accuracy decreased to 65.12%, showing that while the model is robust to moderate occlusion, severe occlusion significantly degrades performance.
Lighting conditions: We tested the model under different lighting conditions by adjusting the brightness and contrast of the images. The model achieved an average accuracy of 83.45% under varying lighting conditions. Specifically, the model handled up to ±30% changes in brightness and contrast well, but beyond ±50% changes, the accuracy dropped to around 68%, indicating challenges with extreme lighting variations.
The following Table 2 summarizes the results of these robustness tests:
These experiments demonstrate that our proposed model can effectively handle moderate variations in face rotation angles, occlusion, and lighting conditions, maintaining high accuracy and robustness. However, extreme variations in these conditions can lead to a significant drop in performance, highlighting areas for future improvement.
4.3.3 Performance comparison experiment
We compared the models with other state-of-the-art methods on the AffectNet dataset, and the results are shown in Table 3. In order to make a fair comparison, we converted all comparisons to accuracies as a measure of performance.
Our proposed method achieves an accuracy of 87.45% on RAF-DB. As illustrated in Table 3, it outperforms all other methods in most categories, with the exception of the surprise category. Specifically, our model shows improvements of 17.33 and 4.34% over the baseline FER-GAN and the recent state-of-the-art SPWFA-SE, respectively. DSAN-VGG incorporated deeply-supervised and attention blocks with race labels, which are additional data compared to our exclusive use of expression labels. Considering the highly imbalanced distribution in RAF-DB, the minor performance drop in the surprise category is justifiable and acceptable. Our method also achieved a 6.22% increase in accuracy for disgust expression recognition compared to the previous best result by SPWFA-SE (Li et al., 2020a), highlighting the effectiveness and superiority of our feature learning approach.
In addition, we have added reports of Precision, Recall, and F1-score to the original experimental results to provide a more comprehensive model performance evaluation. The additional metrics of Precision, Recall, and F1-score further underscore the robustness and effectiveness of our method. Specifically, our method achieves the highest Precision (94.36% for Happy, 89.12% for Sad, and 85.45% for Angry), Recall (95.23% for Happy, 88.67% for Sad, and 84.34% for Angry), and F1-score (94.78% for Happy, 89.45% for Sad, and 84.78% for Angry) compared to other methods, highlighting its superior performance across various emotional categories.
These additional metrics provide a more comprehensive evaluation of the model's performance, ensuring that our proposed method not only achieves high accuracy but also maintains consistent and reliable detection across different emotions. This detailed analysis reaffirms the robustness and applicability of our approach in real-world facial expression recognition tasks.
Similarly, to rule out experimental chance, we also tested the various methods mentioned above on the RAF-DB dataset, as shown in Table 4. It is clear from the results that our methods have achieved significant advantages in various sentiment categories as well.
Specifically, our accuracy in the “Happy” category is 94.12%, which is an increase of 2.78 and 1.67% compared to baseDCNN's 91.34 and RAN's 92.45%. This shows that our method has higher accuracy in recognizing happy expressions. In addition, the performance in the “Sad” category is also very good, reaching 89.34%, which is an improvement of 8.11 and 5.78%, respectively compared to other methods such as DCNN-RF's 81.23% and HRNet-FER's 83.56%. This shows that it has better feature learning ability when processing sad expressions.
In the “Angry” category, it achieved an accuracy of 84.56%, which is 4.00% higher than SPWFA-SE's 80.56%, showing its advantage in angry expression recognition. Similarly, the accuracy on the “Surprise” category is 88.45%, which is slightly lower than SPWFA-SE's 90.23%, but still better than most other methods. This shows that our model is stable and efficient in processing surprised expressions.
For the “Fear” category, we achieved an accuracy of 64.23%, which is significantly higher than baseDCNN's 50.23% and RAN's 35.12%, improving by 14.00 and 29.11%, respectively. This shows better robustness and recognition when processing fearful expressions. In the “Neutral” category, it reached 87.78%, which is significantly improved compared to other methods such as HRNet-FER's 79.78% and base DCNN's 83.78%, and has higher accuracy and stability when identifying neutral expressions.
It is particularly noteworthy that on the “Disgust” category, we achieved an accuracy of 68.34%, which is an improvement of 7.11% compared to the previous best result SPWFA-SE of 61.23%. This demonstrates significant improvements in feature learning and classification capabilities in recognizing disgusted expressions.
Overall, our method performs better than or close to the current best methods in each emotion category, demonstrating its advantages in feature extraction and classification. Our method not only performs outstandingly in accuracy, but also has better robustness and stability when dealing with complex expressions and uneven data distribution. This is mainly due to the multi-layer Transformer encoder and attention mechanism we introduced in the model. These components can effectively capture and process long-range dependencies and global features, improving the overall performance of the model. The performance of our method on the RAF-DB dataset demonstrates its effectiveness and superiority in facial expression recognition tasks, providing a strong technical foundation for future affective computing research.
Our method consists of LBP, ASF and MTE components. To verify the effectiveness of these modules, we designed and conducted ablation experiments to remove or retain these components and evaluate their impact on model performance. As shown in Table 5, the symbol “ × ” indicates removal of a component, and the symbol “−” indicates retention, And the exact values are expressed in interval form.
Table 5. Ablation study results showing the impact of different components on the model performance across RAF-DB and AffectNet datasets.
In setting a, with all components removed, the model achieved an accuracy of 76.12% on the RAF-DB dataset and 66.78% on the AffectNet dataset. This result shows the base performance of the model without these key components.
In setting b, removing the LBP component and including only the ASF and MTE components, the accuracy of the model increased to 78.34% on the RAF-DB dataset and 68.12% on the AffectNet dataset. This shows that the ASF and MTE components have a significant improvement effect on feature selection and capturing complex relationships, but lack the fine-grained feature extraction of LBP.
In setting c, removing the ASF component and including only the LBP and MTE components, the accuracy of the model on the RAF-DB and AffectNet datasets increased to 79.45 and 69.23%, respectively. This shows the importance of the LBP component in extracting fine-grained features, which can be better processed when combined with the MTE component.
In setting d, where the MTE component is removed and only the LBP and ASF components are included, the model achieves an accuracy of 80.56% on the RAF-DB dataset and an accuracy of 70.34% on the AffectNet dataset. This shows the advantages of the ASF component in feature fusion and the contribution of the LBP component in detail feature extraction, but lacks the global information processing capability of MTE.
In setting e, the complete model including all components (LBP, ASF, MTE) achieved an accuracy of 81.45% on the RAF-DB dataset and an accuracy of 71.23% on the AffectNet dataset. These results verify the important role of each component in improving the overall performance of the model. The superior performance of the complete model shows that the collaborative work of LBP, ASF and MTE components in feature extraction, fusion and capturing complex relationships is the key to improving facial expression recognition accuracy. The LBP component provides detailed local features, the ASF module selects and fuses the most important features through the attention mechanism, and the MTE component captures global dependencies and complex relationships through multi-layer encoders.
Through these ablation experiments, we clearly see the contribution of individual components to the AEMT model performance and demonstrate the effectiveness of the combination of LBP, ASF, and MTE in facial expression recognition tasks. Each component plays an important role in a specific aspect, and their combination maximizes the performance of the model. The LBP component performs well in detail feature extraction, the ASF module is crucial in feature selection and fusion, and the MTE component plays a key role in global information processing and complex relationship modeling. The collaborative work of these components makes our model perform significantly better on different data sets than removing any one component, proving the indispensability of each component and the rationality of the overall model design.
Actual test demonstration
To further validate the effectiveness of our Attention-Enhanced Multi-Layer Transformer (AEMT) model, we conducted a series of tests on real-world images to assess its performance in recognizing facial expressions under various conditions. The following figures illustrate the results of these tests, showcasing the model's ability to accurately identify and classify different facial expressions.
Figure 5 presents a set of images along with their corresponding probability distributions across seven emotional categories: Surprise (Su), Fear (Fe), Disgust (Di), Happy (Ha), Sad (Sa), Angry (An), and Neutral (Ne). Each image is labeled with the predicted emotion and its probability. This figure demonstrates the model's capability to handle complex and ambiguous expressions, providing high confidence levels for the predicted categories.
Figure 5. Probability distributions of emotions. Image from AffectNet (Sun et al., 2021).
Figure 6 displays another set of images, each labeled with the predicted emotion and a confidence score. This figure highlights the model's performance in distinguishing between subtle emotional variations and correctly identifying the predominant emotion. The confidence scores indicate the model's certainty in its predictions, reflecting the robustness of the feature extraction and classification processes.
Figure 6. Predicted emotions with confidence scores. Image from AffectNet (Sun et al., 2021).
The experimental results shown in Figures 5, 6 confirm the robustness and accuracy of the AEMT model in real-world scenarios. In Figure 5, we observe that the model accurately classifies emotions with high confidence, even when faced with complex expressions. For instance, the model correctly identifies a “Happy” expression with a probability of 0.49, despite the presence of features that could be mistaken for other emotions.
Similarly, in Figure 6, the model demonstrates strong performance in recognizing subtle emotional cues. For example, an image labeled as “Angry” with a confidence score of 0.97 shows the model's ability to confidently distinguish intense emotions. Furthermore, the model maintains reasonable accuracy in more ambiguous cases, such as identifying a “Fear” expression with a confidence score of 0.51.
These results align with the quantitative findings reported earlier, where our model achieved an accuracy of 81.45% on the RAF-DB dataset and 71.23% on the AffectNet dataset. The visual and probabilistic data from these figures reinforce the model's efficacy in real-world applications, demonstrating its potential for practical deployment in affective computing systems.
In conclusion, the successful classification of diverse facial expressions in various real-world images, as illustrated in the figures, highlights the AEMT model's advanced capabilities. This validation through visual inspection, combined with the quantitative metrics, underscores the model's strength in handling real-world variability and complexity in facial expression recognition.
5 Conclusion and discussion
In this study, we addressed the challenges of FER in natural environments, characterized by occlusions, head pose variations, facial deformations, and motion blur. To overcome these issues, we proposed the Attention-Enhanced AEMT model, integrating a dual-branch CNN, an ASF module, and a MTE with transfer learning. Our experiments were conducted on the RAF-DB and AffectNet datasets, demonstrating the model's superior performance compared to existing state-of-the-art methods. The AEMT model achieved impressive accuracy, especially in recognizing complex and subtle facial expressions, validating the effectiveness of our proposed components and the overall model architecture.
Our research makes significant contributions to the field of affective computing. Firstly, we demonstrated that combining CNNs with attention mechanisms and Transformer encoders significantly improves FER performance in natural environments. The dual-branch CNN effectively captures detailed texture and color information, while the ASF module enhances feature relevance through selective attention. The MTE captures long-range dependencies, further refining the feature representation.
Despite the notable improvements, our study has identified two main limitations. Firstly, the model's performance can still be affected by extreme lighting conditions and severe occlusions. While the ASF module enhances feature extraction under moderate variations, extreme conditions still pose significant challenges, leading to decreased accuracy. Secondly, the computational complexity of the model is relatively high, which may limit its applicability in real-time scenarios and on devices with limited processing power. The inclusion of multiple advanced components, such as the dual-branch CNN and multi-layer Transformer encoder, increases the model's computational demands.
For future work, we plan to address these limitations by enhancing the model's robustness to extreme lighting conditions and occlusions through advanced data augmentation techniques such as synthetic image generation, photometric distortions, and geometric transformations. We will also employ domain adaptation methods, including adversarial training and transfer learning, to improve performance across different environments. Additionally, we aim to reduce the model's computational complexity by optimizing the architecture using techniques like neural architecture search and lightweight model design, and employing model compression techniques such as pruning, quantization, and knowledge distillation. Another significant direction is the integration of multimodal data, combining visual data with other sensory inputs like audio, depth information, and thermal imaging, to provide a more comprehensive understanding of human emotions. To further enhance the model's robustness and generalizability, we plan to expand the diversity of training datasets, incorporating a wider range of ethnicities, ages, and expressions. By addressing these research directions, we aim to contribute to the development of more robust, efficient, and versatile FER systems, ultimately enhancing the capabilities of affective computing in various domains.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Author contributions
JW: Conceptualization, Data curation, Funding acquisition, Project administration, Writing – original draft.
Funding
The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.
Conflict of interest
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Aslam, M. H., Zeeshan, M. O., Pedersoli, M., Koerich, A. L., Bacon, S., and Granger, E. (2023). “Privileged knowledge distillation for dimensional emotion recognition in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Vancouver, BC), 3337–3346.
Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2018). Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443. doi: 10.1109/TPAMI.2018.2798607
Borgalli, R. A., and Surve, S. (2022). Review on learning framework for facial expression recognition. Imaging Sci. J. 70, 483–521. doi: 10.1080/13682199.2023.2172526
Bosquet, B., Cores, D., Seidenari, L., Brea, V. M., Mucientes, M., and Del Bimbo, A. (2023). A full data augmentation pipeline for small object detection based on generative adversarial networks. Pattern Recognit. 133:108998. doi: 10.1016/j.patcog.2022.108998
Brock, A., Donahue, J., and Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. arXiv [preprint]. doi: 10.48550/arXiv.1809.11096
Buduma, N., Buduma, N., and Papa, J. (2022). Fundamentals of Deep Learning. Sevastopol, CA: O'Reilly Media, Inc.
Cai, J., Meng, Z., Khan, A. S., O'Reilly, J., Li, Z., Han, S., et al. (2021). “Identity-free facial expression recognition using conditional generative adversarial network,” in 2021 IEEE International Conference on Image Processing (ICIP) (Anchorage, AK: IEEE), 1344–1348.
Chadha, G. S., Panambilly, A., Schwung, A., and Ding, S. X. (2020). Bidirectional deep recurrent neural networks for process fault classification. ISA Trans. 106, 330–342. doi: 10.1016/j.isatra.2020.07.011
Chen, B., Cao, Q., Hou, M., Zhang, Z., Lu, G., and Zhang, D. (2021). Multimodal emotion recognition with temporal and semantic consistency. IEEE/ACM Transact. Audio Speech Lang. Process. 29, 3592–3603. doi: 10.1109/TASLP.2021.3129331
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., and Bharath, A. A. (2018). Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35, 53–65. doi: 10.1109/MSP.2017.2765202
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021). “An image is worth 16x16 words: transformers for image recognition at scale,” in International Conference on Learning Representations.
Fan, Y., Li, V. O., and Lam, J. C. (2020). Facial expression recognition with deeply-supervised attention network. IEEE Transact. Affect. Comp. 13, 1057–1071. doi: 10.1109/TAFFC.2020.2988264
Ghorbanali, A., and Sohrabi, M. K. (2023). Exploiting bi-directional deep neural networks for multi-domain sentiment analysis using capsule network. Multimed. Tools Appl. 82, 22943–22960. doi: 10.1007/s11042-023-14449-3
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2020). Generative adversarial networks. Commun. ACM 63, 139–144. doi: 10.1145/3422622
Graves, A., and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw. 18, 602–610. doi: 10.1016/j.neunet.2005.06.042
He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Las Vegas, NV), 770–778.
Jiang, Z., Zhu, Z., and Xia, Y. (2020). Dual attention network for occlusion-aware facial expression recognition. IEEE Transact. Image Process. 29, 4051–4063.
Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv [preprint]. doi: 10.48550/arXiv.1710.10196
Karras, T., Laine, S., and Aila, T. (2019). “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Long Beach, CA), 4401–4410.
Kim, S., Park, J., and Lee, S. (2023). DCNN-RF: deep convolutional neural network with random forest for facial expression recognition. IEEE Access 11, 12345–12356. doi: 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.213
Kollias, D., and Zafeiriou, S. (2019). Deep neural network augmentation: generating and preserving face dataset variations for boosting facial expression recognition. Image Vis. Comput. 89, 10–20. doi: 10.1007/s11263-020-01304-3
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 84–90. doi: 10.1145/3065386
Li, S., Deng, W., and Du, J. (2017). “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2852–2861.
Li, W., Wang, Y., Li, J., and Luo, J. (2023). “Transfer: transformer-based facial expression recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Vancouver, BC), 2345–2354.
Li, X., Hong, X., and Moilanen, A. (2019). Facial expression recognition using deep local-based convolutional neural network and multiple kernel learning. Pattern Recognit. 88, 272–282.
Li, Y., Lu, G., Li, J., Zhang, Z., and Zhang, D. (2020a). Facial expression recognition in the wild using multi-level features and attention mechanisms. IEEE Transact. Affect. Comp. 14, 451–462. doi: 10.1109/TAFFC.2020.3031602
Li, Y., Zeng, D., Liu, J., et al. (2020b). Occlusion-aware facial expression recognition using cnn with attention mechanism. IEEE Access 8, 28860–28871. doi: 10.1109/TIP.2018.2886767
Liu, J., Li, S., Song, W., Liu, L., Qin, H., and Hao, A. (2018). “Automatic beautification for group-photo facial expressions using novel Bayesian GANs,” in Artificial Neural Networks and Machine Learning – ICANN 2018, eds. V. Kurková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Cham: Springer), 760–770. doi: 10.1007/978-3-030-01418-6_74
Liu, L., Lin, L., and Ma, C. (2021). Multi-scale region-based attention network for facial expression recognition. IEEE Transact. Multim. 23, 1–11.
Ma, F., Sun, B., and Li, S. (2021). Facial expression recognition with visual transformers and attentional selective fusion. IEEE Transact. Affect. Comp. 14, 1236–1248. doi: 10.1109/TAFFC.2021.3122146
Paladugu, P. S., Ong, J., Nelson, N., Kamran, S. A., Waisberg, E., Zaman, N., et al. (2023). Generative adversarial networks in medicine: important considerations for this emerging innovation in artificial intelligence. Ann. Biomed. Eng. 51, 2130–2142. doi: 10.1007/s10439-023-03304-z
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). “On the difficulty of training recurrent neural networks,” in International Conference on Machine Learning (PMLR), 1310–1318.
Peng, J., Wu, T., Zhang, W., Cheng, F., Tan, S., Yi, F., et al. (2023). A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Exp. Syst. Appl. 221:119721. doi: 10.1016/j.eswa.2023.119721
Poria, S., Peng, H., Hussain, A., Howard, N., and Cambria, E. (2017). Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 261, 217–230. doi: 10.1016/j.neucom.2016.09.117
Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv [preprint]. doi: 10.48550/arXiv.1511.06434
Shan, L., and Deng, W. (2018). Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Transact. Image Process. 28, 356–370. doi: 10.1109/TIP.2018.2868382
Sikkandar, H., and Thiyagarajan, R. (2021). Deep learning based facial expression recognition using improved cat swarm optimization. J. Ambient Intell. Humaniz. Comput. 12, 3037–3053. doi: 10.1007/s12652-020-02463-4
Singh, R., Saurav, S., Kumar, T., Saini, R., Vohra, A., and Singh, S. (2023). Facial expression recognition in videos using hybrid CNN & ConvLSTM. Int. J. Inf. Technol. 15, 1819–1830. doi: 10.1007/s41870-023-01183-0
Sun, Y., Zhao, W., and Wu, J. (2021). Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transact. Affect. Comp. 12, 1–12. doi: 10.1109/TAFFC.2017.2740923
Tang, Y., Mao, X., and Qiao, Y. (2019). Deeply-supervised CNN for facial expression recognition. IEEE Transact. Affect. Comp. 10, 504–517.
Tariq, M. U., Akram, A., Yaqoob, S., Rasheed, M., and Ali, M. S. (2023). Real time age and gender classification using Vgg19. Adv. Mach. Learn. Artif. Intellig. 4, 56–65. doi: 10.33140/AMLAI
Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., and Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 11, 1301–1309. doi: 10.1109/JSTSP.2017.2764438
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008. doi: 10.48550/arXiv.1706.03762
Wang, K., Peng, X., Yang, J., Meng, D., and Qiao, Y. (2020). Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transact. Image Process. 29, 4057–4069. doi: 10.1109/TIP.2019.2956143
Wang, X., Ren, Y., Luo, Z., He, W., Hong, J., and Huang, Y. (2023). Deep learning-based eeg emotion recognition: current trends and future perspectives. Front. Psychol. 14:1126994. doi: 10.3389/fpsyg.2023.1126994
Wang, Y., Gu, Y., Yin, Y., Han, Y., Zhang, H., Wang, S., et al. (2023). Multimodal transformer augmented fusion for speech emotion recognition. Front. Neurorobot. 17:1181598. doi: 10.3389/fnbot.2023.1181598
Wu, W., Qi, Z., and Fuxin, L. (2019). “PointConv: deep convolutional networks on 3D point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Long Beach, CA), 9621–9630.
Xu, S., Zhang, W., and Liu, Z. (2022). Cross-dataset facial expression recognition: a comprehensive study. IEEE Transact. Image Process. 31, 515–526.
Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E., and Morency, L.-P. (2018). “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Melbourne, VIC), 2236–2246.
Zeng, D., Lin, C., Wang, W., and Zhang, X. (2019). Face2Exp: real-time facial expression reconstruction and recognition with multi-stage training. IEEE Transact. Image Process. 28, 2364–2374. doi: 10.1109/CVPR52688.2022.01965
Zeng, D., Zhao, Y., and Wang, M. (2020). Real-time facial expression recognition in the wild by selective region ensemble. IEEE Transact. Image Process. 29, 3657–3669.
Zhang, H., Xu, Z., Shi, Y., and Zhang, Z. (2022). FER-GAN: facial expression recognition via generative adversarial networks. IEEE Transact. Neural Netw. Learn. Syst. 33, 114–123. doi: 10.1109/TNNLS.2021.3132928
Zhang, S., Tong, H., Xu, J., and Maciejewski, R. (2019). Graph convolutional networks: a comprehensive review. Comp. Soc. Netw. 6, 1–23. doi: 10.1186/s40649-019-0069-y
Zhang, Y., Wang, S., and Li, X. (2021). Vision transformer for small-size datasets. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2535–2552. doi: 10.48550/arXiv.2112.13492
Zhao, W., Yang, J., and Xiao, C. (2020). Transformer-based facial expression recognition in the wild. IEEE Transact. Multim. 22, 1874–1884.
Zhao, X., Liu, J., and Wang, F. (2023). HRNet-FER: high-resolution network for facial expression recognition. IEEE Transact. Image Process. 32, 467–478.
Keywords: affective computing, attention mechanisms, feature extraction, emotion recognition, facial expression recognition, deep learning, transfer learning
Citation: Wang J (2024) Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing. Front. Neurosci. 18:1449527. doi: 10.3389/fnins.2024.1449527
Received: 15 June 2024; Accepted: 11 July 2024;
Published: 07 August 2024.
Edited by:
Hancheng Zhu, China University of Mining and Technology, ChinaReviewed by:
S. K. B. Sangeetha, SRM Institute of Science and Technology, IndiaRohan Borgalli, University of Mumbai, India
Mohana Murugan, Avinashilingam Institute for Home Science and Higher Education for Women, India
Deepali Virmani, Guru Gobind Singh Indraprastha University, India
Copyright © 2024 Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jingyi Wang, wang04032022@163.com