- Sports Department of Zhejiang A&F University, Hangzhou, Zhejiang, China
Introduction: In recent years, with advancements in wearable devices and biosignal analysis technologies, sports performance analysis has become an increasingly popular research field, particularly due to the growing demand for real-time monitoring of athletes' conditions in sports training and competitive events. Traditional methods of sports performance analysis typically rely on video data or sensor data for motion recognition. However, unimodal data often fails to fully capture the neural state of athletes, leading to limitations in accuracy and real-time performance when dealing with complex movement patterns. Moreover, these methods struggle with multimodal data fusion, making it difficult to fully leverage the deep information from electroencephalogram (EEG) signals.
Methods: To address these challenges, this paper proposes a "Cerebral Transformer" model based on EEG signals and video data. By employing an adaptive attention mechanism and cross-modal fusion, the model effectively combines EEG signals and video streams to achieve precise recognition and analysis of athletes' movements. The model's effectiveness was validated through experiments on four datasets: SEED, DEAP, eSports Sensors, and MODA. The results show that the proposed model outperforms existing mainstream methods in terms of accuracy, recall, and F1 score, while also demonstrating high computational efficiency.
Results and discussion: The significance of this study lies in providing a more comprehensive and efficient solution for sports performance analysis. Through cross-modal data fusion, it not only improves the accuracy of complex movement recognition but also provides technical support for monitoring athletes' neural states, offering important applications in sports training and medical rehabilitation.
1 Introduction
The growing need for advanced athletic performance analysis has led to increased interest in leveraging Electroencephalography (EEG) data for real-time monitoring and performance enhancement (Cao and Li, 2021). EEG data not only reflects an athlete's neural state but also enables real-time tracking of focus, fatigue, and strategy adjustments during physical activities (Friesen and Park, 2022). Performance monitoring relies not only on external movement data but also on capturing internal neural dynamics, offering athletes a more comprehensive and personalized training regimen (Zhang and Jiang, 2020). Moreover, EEG data's real-time characteristics provide the potential for immediate feedback during physical activities, helping athletes optimize their techniques while preventing injuries (Rao and Zhang, 2023). Thus, using EEG data to enhance athletic performance is not only academically significant but also holds considerable potential in practical applications such as sports training and rehabilitation (Cote and Whelan, 2021).
To overcome the limitations of traditional athletic performance analysis methods that fail to effectively process EEG signals, early research relied on symbolic AI and knowledge representation. In these approaches, EEG signals were interpreted as symbolic information processed through predefined rules or logical reasoning (Wang and Song, 2021). These methods excelled in specific scenarios by leveraging structured knowledge, offering interpretability for certain athletic states. However, symbolic AI methods are heavily dependent on predefined knowledge bases, making them inadequate for handling the complex, nonlinear fluctuations found in EEG signals (Parihar and Acharya, 2021). Additionally, they struggled with the high-dimensional nature of EEG data, especially in contexts with individual athlete differences and diverse movement patterns (Fuentes and Gomez, 2022). To address the shortcomings of symbolic AI, researchers shifted toward data-driven methods (Lee and Kang, 2020).
As large-scale EEG datasets became available, data-driven and machine learning approaches began to dominate. These methods learn patterns from the data itself, without relying on predefined rules (Zhang and Zhao, 2021). Statistical models and traditional machine learning algorithms, such as Support Vector Machines (SVM), were employed to automatically extract features and classify EEG data (Li and Zhou, 2022). Compared to symbolic AI, data-driven methods significantly improved the handling of nonlinear EEG signals and complex athletic scenarios (Duan and Xiao, 2023). However, these methods relied on manual feature extraction, which did not fully capture all the rich information in EEG data, limiting performance when dealing with high-dimensional, noisy data (Li and Sun, 2021). Furthermore, machine learning methods often struggled with overfitting when data was limited or of lower quality (Sun and Gu, 2023). In response, deep learning became a promising solution to further automate feature extraction and improve accuracy (Gao and Li, 2023).
Deep learning revolutionized EEG signal analysis by providing automated feature extraction and modeling capabilities, particularly with Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) (Roy and Das, 2021). These models could automatically learn multi-layered features from large EEG datasets, greatly improving prediction accuracy (Zhang and Chen, 2023). Additionally, deep learning's end-to-end training capability allowed for direct learning from raw EEG signals to performance prediction, eliminating the need for complex manual feature design (Li and Wu, 2023). However, deep learning came with its own set of challenges, including high computational complexity and a strong dependence on large labeled datasets for training (Xu and Zhang, 2021). With the rise of pre-trained models, researchers began to leverage pre-trained deep learning models and apply transfer learning to EEG data, reducing the reliance on vast amounts of labeled data (Ma and Tang, 2022). While these methods enhanced automation and performance, they still faced challenges when processing multimodal data (such as EEG and video fusion), and computational complexity remained a barrier to real-time applications (Shah and Kumar, 2022).
To address these limitations, we propose the Cerebral Transformer model. This model leverages adaptive attention mechanisms and cross-modal fusion techniques to effectively integrate EEG signals with video data, overcoming the shortcomings of traditional deep learning methods in handling multimodal data. The model also introduces a pre-trained Transformer architecture, significantly reducing training complexity and making it more efficient when processing large-scale, high-dimensional EEG data.
• Cerebral Transformer integrates cross-modal attention mechanisms and efficiently fuses EEG and video data, excelling in multimodal data analysis.
• The method is highly versatile and efficient, suitable for multi-scenario athletic performance monitoring and capable of real-time processing of complex EEG and video data.
• Across multiple datasets, Cerebral Transformer outperforms existing methods in accuracy and recall while significantly reducing inference time, making it ideal for real-time applications.
2 Related work
2.1 EEG signals in sports performance
Electroencephalogram (EEG) signals, as a non-invasive tool for monitoring neural activity, have gained widespread attention in recent years in the field of sports performance analysis. EEG signals can reflect athletes' neural activities, helping to understand changes in focus, fatigue, and emotional states during physical activities. Early research mostly focused on using EEG signals in areas such as emotion analysis, fatigue detection, and neuro-rehabilitation (Wang et al., 2023). In sports performance analysis, researchers have begun to integrate EEG signals with motor control theory to study the relationship between neural activity and movement patterns. For example, some studies have analyzed athletes' EEG signals during competitions to reveal neural network activity patterns in the brain during complex movements (Neuwirth and Emenike, 2024). These studies suggest that EEG signals can be used to monitor athletes' neural states in real-time, providing insights for training adjustments and performance improvement (Zong et al., 2024). However, traditional EEG analysis methods often rely on handcrafted feature extraction, which is limited by high data dimensionality and significant noise interference, leading to poor model generalization. Recently, deep learning applications in EEG signal processing have increased, with methods like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) being used to extract spatiotemporal features from EEG data. However, these models still face challenges in fusing EEG data with other modalities (Pilacinski et al., 2024). Therefore, integrating EEG signals with other sports data to enhance the comprehensive understanding of sports performance has become an important research direction. Cheng C. et al. (2024) employ hierarchical spatiotemporal transformers to capture regional and global brain dynamics for emotion recognition, conceptually consistent with the adaptive attention strategy in our model; Ning et al. (2023) combine spatial, spectral, and temporal attention with meta-learning to enhance EEG emotion recognition, complementing the multi-scale fusion approach in our approach. Jia et al. (2024) introduce knowledge distillation techniques for heterogeneous multi-layer representations of sleep staging, inspiring our representation refinement approach. These studies provide a broader context for our work, highlighting the importance of powerful spatiotemporal attention and multimodal fusion strategies. By situating our model within these advances, we emphasize how our approach builds on and extends current approaches. The Introduction has been revised to reflect these discussions and to establish clearer connections to existing works, thereby enhancing the relevance and positioning of our contribution.
2.2 Video and motion sensor data in sports analysis
Traditional sports analysis methods mainly rely on video data or motion sensor data, which have been widely used in sports, human posture recognition, and health monitoring. The advantage of video data is its ability to capture dynamic movements, and deep learning methods like CNNs can extract key features such as posture and movement speed from videos (Minen et al., 2023). For instance, in human posture estimation, researchers can process video with multi-scale convolutional neural networks to efficiently identify key points of an athlete's body, such as elbows and knees, and calculate movement trajectories (Yu et al., 2022). However, video processing often requires significant computational resources, and performance can degrade when dealing with low-quality videos (Neuwirth and Whigham, 2023). Motion sensor data, such as accelerometers and gyroscopes, can provide more precise information on movement trajectories and acceleration, making them important for real-time motion monitoring. Traditional methods often use machine learning models based on statistical features to analyze these data, but these models typically struggle to capture complex spatiotemporal dependencies. With the rise of deep learning, methods like spatiotemporal CNNs (ST-GCNs) and temporal neural networks (e.g., LSTM) have increasingly been applied to motion sensor data, significantly improving recognition accuracy for complex movement patterns (Cheng S. et al., 2024). However, using video or sensor data alone often leads to information loss and cannot fully capture the athlete's internal neural state. Therefore, cross-modal data fusion has become a key research trend in this field (Pan J. et al., 2024).
2.3 Cross-modal fusion in sports analysis
As multimodal data becomes more accessible, cross-modal data fusion techniques have emerged as a crucial direction in sports performance analysis. Cross-modal fusion aims to effectively combine data from multiple sources (such as EEG signals, video, and motion sensor data) to provide a more comprehensive evaluation of sports performance (Yang et al., 2024). Traditional cross-modal fusion methods often employ early fusion or late fusion strategies. Early fusion merges data from different modalities at the input stage through simple concatenation or combination, while late fusion combines the predictions from independently trained models for each modality. However, these approaches can lead to information loss or modality inconsistency. Recently, attention-based cross-modal fusion methods have gained popularity (Neuwirth et al., 2023). Self-attention mechanisms can dynamically assign weights across different modalities, enabling efficient information integration. For example, some studies have introduced multimodal attention mechanisms in sports performance analysis to fuse EEG and video features, significantly improving the accuracy of action recognition (Pilacinski et al., 2024). Additionally, the Transformer model, known for its success in natural language processing, has been gradually applied to cross-modal data tasks. By incorporating global attention mechanisms, Transformers can capture long-range dependencies between different modalities, making them particularly suitable for handling spatiotemporally heterogeneous data like EEG signals and video. As cross-modal fusion technology continues to evolve, its application in sports performance analysis will help improve model accuracy and generalization, leading to comprehensive monitoring and precise analysis of athletes' conditions (Hu et al., 2021).
3 Methodology
3.1 Overview
The proposed model, referred to as the Cerebral Transformer for Athletic Performance, aims to enhance the recognition and analysis of complex athletic movements using EEG data and video inputs. This model builds upon advanced attention mechanisms, including multi-scale and hybrid attention, to effectively process and integrate the diverse temporal and spatial information present in athletic actions. By leveraging a transformer-based architecture, the model is capable of capturing intricate relationships within both the spatial dimensions of video inputs and the temporal sequences of EEG signals, thereby enabling a deeper understanding of athletic performance and related neural activities. The overall data flow of the model begins with preprocessing of raw EEG signals and video inputs, followed by feature extraction stages for both modalities. These extracted features are then passed through multiple attention layers designed to capture both local and global dependencies across the spatial-temporal domains. The attention mechanisms used include a hybrid of local self-attention and k-NN attention, allowing the model to focus on the most relevant segments of the input data while ignoring noisy or irrelevant information. Additionally, the model integrates a fusion mechanism to combine predictions from the separate EEG and video streams, resulting in more accurate and holistic action recognition (as shown in Figure 1).
Figure 1. Cerebral transformer architecture. The data flow starts with video frames processed by ResNet-MC to generate the Global GridToken (G), while the skeleton data generates a Joint Heat Map to extract Joint MapToken (J). These tokens are concatenated with the Class Token to form the MultiClass Token Z. After positional encoding, the tokens enter the Cerebral Transformer, which processes them using mechanisms like self-attention, k-NN attention, and Cross-Modal Attention. Finally, the output is passed through an MLP to produce the final classification result.
“Cross-modal fusion” refers to the process of integrating information from multiple modalities, such as EEG and fMRI data, to leverage complementary features from each modality. This fusion typically involves aligning and combining the spatial, temporal, and spectral features extracted from each modality to enhance model performance. By effectively integrating diverse types of information, cross-modal fusion can improve the robustness and accuracy of downstream tasks, such as emotion recognition or sleep staging, by capturing patterns that may not be discernible within a single modality alone. “Adaptive attention mechanism" is a dynamic technique that allows a model to focus on the most relevant features or regions of the input data during different stages of processing. Unlike static attention methods that assign fixed weights, adaptive attention dynamically adjusts its focus based on the data and task requirements, enabling the model to better capture complex spatial-temporal dependencies or modality-specific features. In our work, this mechanism is designed to prioritize features across modalities and time steps, allowing for more effective learning and generalization in EEG-based tasks.
In this section, we provide a detailed breakdown of the model's architecture and data flow. Section 3.2 describes the fundamental data preprocessing steps for EEG and video inputs, focusing on how raw signals are transformed into actionable features. Section 3.3 explores the core transformer components of the model, including the multi-scale attention mechanism designed to handle the varying durations of athletic actions. Finally, Section 3.4 covers the fusion strategy employed to combine EEG and video-based predictions for improved performance. These components are critical to the model's ability to adaptively process diverse types of input data and recognize complex athletic actions in real-time settings.
3.2 Preliminaries
In this work, we address the problem of recognizing and analyzing complex athletic performance using EEG signals and video data. Formally, let represent the input sequence of EEG signals recorded over time, where denotes the EEG data at time step t and d is the dimensionality of the EEG signal. Similarly, let denote the corresponding video frames, where represents the frame at time step t, with h, w, and c denoting the height, width, and number of color channels of the frame, respectively. The goal is to map these sequences to a set of actions or movement labels , where each corresponds to one of the possible action classes from a predefined set . To solve this problem, we define a model that learns a mapping , where the input consists of both EEG signals and video frames. The model must take into account both the spatial information present in the video frames and the temporal dependencies between consecutive frames and EEG signals. To do this, we utilize a transformer-based architecture that is well-suited for capturing both local and global dependencies across the input data. The core challenge lies in handling the high dimensionality and multimodal nature of the input. The EEG data provides temporal information about neural activity, while the video frames contain spatial and temporal information about the athlete's movement. Formally, the input can be represented as a joint distribution , where and are conditionally dependent on the latent state of the athlete's actions. The objective of the model is to maximize the likelihood of the observed labels, i.e.,
where θ denotes the model parameters.
To achieve this, the model employs a sequence of operations that include both attention mechanisms and feature extraction techniques to transform the raw EEG signals and video frames into a latent representation that is suitable for classification. Let and represent the hidden states for the EEG and video data, respectively, where dh and hv are the dimensionalities of the hidden states. These hidden representations are obtained through a series of linear transformations and attention-based layers.
At each time step t, the attention mechanism computes a context vector ct for both EEG and video data as follows:
where and represent the hidden states at time step j, and and are attention weights that indicate the relevance of the hidden states at time step j with respect to the current time step t.
The attention weights are computed using a scaled dot-product attention mechanism:
where , , , and are query and key vectors derived from the EEG and video hidden states, respectively.
The context vectors and are then passed through a final classification layer that outputs the predicted action labels for each time step. The overall loss function is defined as the cross-entropy between the predicted labels and the ground truth labels:
where yt, c is the ground truth label at time step t for class c, and is the predicted probability of class c at time step t.
Through this approach, the model is able to learn a joint representation of EEG and video data that captures both the neural activity and physical movements of the athlete, ultimately enabling accurate action recognition and performance analysis.
3.3 Multi-stream module
Building on the foundation laid in the preliminaries, the proposed model introduces a novel Adaptive Attention-based Multi-Stream Module to efficiently process the multimodal input data consisting of EEG signals and video frames. This module is designed to handle the complexity of both spatial and temporal dimensions, particularly in the context of recognizing athletic performance. The module integrates adaptive attention mechanisms and hierarchical feature extraction layers that are tailored for the unique characteristics of athletic movements and neural activity. The module is composed of two parallel streams—one for EEG signals and the other for video frames—with separate attention blocks dedicated to each modality. The adaptive attention mechanism dynamically adjusts the focus on relevant features based on the task at hand. This is achieved by employing both local and global attention layers to capture short-term and long-term dependencies within each modality, followed by a cross-modal attention block that fuses the features from both streams (as shown in Figure 2).
Figure 2. Illustration of the proposed Multi-Stream Module architecture and data flow. The model integrates EEG and video inputs through their respective encoders, followed by cross-modal fusion and adaptive attention mechanism. Key components such as Cross-Modal Attention Mechanism (CMAM), convolutional layers, GELU activations, and upsampling are used to process the extracted features and finally obtain the classification output. The figure gives a detailed overview of the hierarchical and modular structure of the proposed method.
3.3.1 EEG stream
The EEG stream processes raw neural signals through a series of attention layers designed to capture temporal dependencies in the data. Formally, let represent the hidden states of the EEG signal after passing through a temporal convolutional layer, where T is the number of time steps and denotes the hidden representation at time step t. We employ an adaptive attention mechanism that weighs the importance of different time steps based on the current state of the model. The attention weights are computed as:
where and are query and key vectors derived from the EEG hidden states. The resulting context vector is then computed as:
3.3.2 Video stream
Similarly, the video stream processes video frames through a spatial attention mechanism, followed by temporal attention to capture the dynamic nature of athletic performance. Let represent the hidden states of the video frames, where each is the hidden representation of the video frame at time step t. The spatial attention layer computes attention weights for each pixel within a frame, enabling the model to focus on the most relevant areas of the athlete's movement:
where and are query and key vectors derived from the video hidden states. The corresponding context vector for the video data is:
3.3.3 Cross-modal attention
To fully leverage the complementary nature of EEG signals and video data, we introduce a cross-modal attention block that fuses the information from both streams. This block is responsible for aligning the temporal sequences from EEG and video modalities and discovering cross-modal dependencies that are crucial for accurate performance analysis. The cross-modal attention weights are computed by combining the context vectors from both streams:
where and are the context vectors from the EEG and video streams, respectively. The final cross-modal context vector is then computed as:
3.3.4 Final prediction layer
The fused cross-modal representation is passed through a fully connected layer followed by a softmax operation to predict the final action class for each time step:
where WCross and bCross are the learned parameters of the final prediction layer.
This multi-stream architecture, powered by adaptive attention mechanisms, enables the model to dynamically adjust its focus based on the importance of various temporal segments and spatial regions, thus improving its ability to recognize complex athletic movements with high precision. By combining EEG and video inputs in this way, the model leverages the strengths of both data modalities, ultimately leading to more accurate and robust performance analysis.
3.4 Performance optimization and training strategy
To achieve optimal performance and efficiency, the model employs two critical strategies: a cyclic learning rate schedule and gradient clipping. These techniques ensure stability during training, enhance convergence speed, and prevent overfitting, allowing the model to generalize effectively across various athletic tasks.
3.4.1 Cyclic learning rate schedule
A cyclic learning rate schedule is used to accelerate convergence and avoid local minima during training. This schedule modulates the learning rate in a cyclical manner, enabling the model to explore different regions of the loss landscape early in training while settling into an optimal solution in the later stages. The learning rate ηt at time step t is given by:
where ηmin and ηmax represent the minimum and maximum learning rates, respectively, and Tcur and Tmax correspond to the current and total number of iterations in the training cycle. This schedule promotes better generalization by enabling the model to periodically adjust its learning rate, escaping local minima and progressively focusing on fine-tuning in the later training stages. The cyclical nature of the learning rate allows for more robust model training, especially in complex multimodal input scenarios such as athletic performance analysis, where high variability in the data can lead to unstable training. By regularly resetting the learning rate, the model avoids stagnation and maintains flexibility throughout the learning process.
3.4.2 Gradient clipping
To maintain stability in training deep transformer models, especially with multimodal inputs, gradient clipping is essential. This technique limits the magnitude of the gradient updates to prevent them from growing too large, which can lead to unstable training dynamics. The clipped gradient is computed as:
where gt is the original gradient at time step t, and τ is the clipping threshold. This method ensures that the gradients remain within a controlled range, stabilizing the learning process and preventing gradient explosion, which is particularly crucial when dealing with large-scale transformer architectures. Gradient clipping plays a key role in maintaining the smooth propagation of updates across multiple layers of the model, ensuring that the optimization process remains stable even when dealing with highly complex or noisy input data, such as EEG signals or fast-moving video frames in athletic performance scenarios. Without gradient clipping, the model could face divergence or overly aggressive updates, leading to suboptimal performance.
4 Experiment
4.1 Experimental details
To evaluate the performance of the proposed model, we conducted experiments on four publicly available datasets: SEED Dataset (Miller et al., 2014), DEAP Dataset (Tripathi et al., 2017), eSports Sensors Dataset (Smerdov et al., 2020), and MODA Dataset (Liu et al., 2023). The SEED dataset is widely used in emotion recognition research, containing EEG data from participants who watched various movie clips designed to elicit different emotions. The DEAP dataset focuses on emotion recognition based on physiological signals, including EEG and peripheral physiological signals, with a large number of subjects providing multimodal data. The eSports Sensors Dataset provides data related to professional gamers, offering EEG and physiological recordings captured during gameplay, making it highly relevant for real-time performance analysis. Lastly, the MODA dataset is a multimodal dataset designed for action recognition, comprising synchronized video and sensor data, including EEG recordings, offering a rich set of complex athletic activities for our task.
In our experiments, we meticulously designed the training and evaluation process to simulate a real-world application scenario, ensuring that the results would provide meaningful insights for practical deployment. Data from each dataset was split into distinct training, validation, and testing sets, with no overlap between subjects across these subsets to eliminate data leakage and ensure robust generalization. The SEED and DEAP datasets, which are relatively large and diverse, were divided into 70% training, 15% validation, and 15% testing to provide sufficient data for training while maintaining adequate samples for model evaluation. For the eSports Sensors and MODA datasets, the splits were adjusted to 60% training, 20% validation, and 20% testing to better accommodate the complexity of the tasks while ensuring enough samples for learning spatial and temporal dependencies critical for the model. For each dataset, we implemented a detailed and systematic hyperparameter tuning process to optimize model performance. The learning rate was initialized at 0.001, with a cyclic learning rate schedule employed to accelerate convergence during training by periodically varying the learning rate. The batch size was set to 64 for the SEED and DEAP datasets due to their smaller input dimensionality and lower computational demands, while it was reduced to 32 for the eSports Sensors and MODA datasets, which involve higher complexity and larger input dimensions. The training process was carried out for 100 epochs on the SEED and DEAP datasets, and extended to 150 epochs for the eSports Sensors and MODA datasets to allow the model to fully learn the intricate temporal and spatial relationships inherent to real-time athletic performance tasks. To mitigate overfitting, an early stopping strategy with a patience of 10 epochs was applied, whereby training was halted if no improvement in validation loss was observed for consecutive epochs. The training environment utilized the PyTorch framework, running on an NVIDIA A100 GPU, which provided the computational efficiency necessary to train the large-scale transformer architecture. For optimization, the Adam optimizer was employed with a weight decay set to 1 × 10−4, ensuring effective regularization to avoid overfitting. Additionally, gradient clipping with a threshold of 1.0 was implemented to stabilize updates and prevent exploding gradients during backpropagation. To further enhance generalization, dropout was applied with a probability of 0.5 across both the EEG and video streams, reducing reliance on specific features and improving robustness to unseen data. The evaluation of the model was carried out using a comprehensive set of metrics to capture both computational efficiency and predictive accuracy. Computational efficiency was quantified by measuring the training time (in seconds), inference time (in milliseconds), the total number of parameters (in millions), and floating point operations (FLOPs, in gigaflops). Accuracy metrics such as accuracy, recall, and F1 score were employed to assess the predictive power of the model across all tasks, ensuring a thorough evaluation of its ability to generalize. These metrics offered a detailed understanding of the model's strengths in various applications, including athletic performance analysis, emotion recognition, and real-time gaming scenarios. Such a robust evaluation framework ensures that the model not only excels in accuracy but also meets the computational demands of real-world deployment, providing a balance between performance and efficiency (Algorithm 1).
4.2 Experimental results and analysis
The results in Table 1 and Figure 3 demonstrate that our proposed model outperforms other SOTA methods across all metrics in both the SEED and DEAP datasets. On the SEED dataset, the model achieves an accuracy of 96.8%, which is higher than the best-performing baseline (CLIP with 95.67%). The proposed model also leads in recall, F1 score, and AUC, indicating its strong ability to capture the subtle patterns in the EEG data and video frames. The advantage of our model is especially prominent in terms of recall (95.45%), which implies that it is more effective in identifying correct instances of athletic performance actions or emotions compared to other models. On the DEAP dataset, a similar trend is observed. The model achieves a remarkable accuracy of 97.34%, which is significantly higher than both CLIP and Hybrid Transformer models. The high F1 score of 92.8% and AUC of 96.2% indicate the model's capacity to balance precision and recall while successfully distinguishing between various emotional states. These results can be attributed to the use of adaptive attention and efficient cross-modal fusion, which enables the model to better capture dependencies between EEG signals and video data. This fusion, in turn, improves the model's understanding of complex, real-time actions and emotions. The superior performance across both datasets supports the effectiveness of the proposed approach in handling multimodal data for athletic performance and emotion recognition.
Table 2 and Figure 4 presents the results on the eSports Sensors and MODA datasets, which are especially relevant for real-time performance analysis. Our model significantly outperforms other SOTA methods, achieving the lowest parameter count, FLOPs, inference time, and training time while maintaining high accuracy and recall scores. On the eSports Sensors dataset, the model achieves an accuracy of 89.45% with only 146.61 million parameters and 161.22 gigaflops, making it computationally efficient for real-time applications. This low computational complexity, combined with the high performance, demonstrates that our model is optimized for scenarios where real-time data processing is crucial, such as during gaming or athletic monitoring. Similarly, on the MODA dataset, our model achieves an accuracy of 97.13% while requiring fewer computational resources than any of the baselines. The lower inference time (140.80 ms) and training time (160.34 s) make our approach suitable for deployment in real-time action recognition systems. The efficiency is largely due to the adaptive attention mechanism and the reduction in redundant computations through efficient cross-modal fusion. By focusing on the most relevant parts of the input data, the model can minimize unnecessary processing, making it highly effective in real-time applications where both speed and accuracy are critical.
The ablation results on the SEED and DEAP datasets in Table 3 highlight the importance of the different components in the proposed model. When the cross-modal attention mechanism is removed, the accuracy drops significantly on both datasets (355.28 M parameters with 293.01 ms inference time on SEED), indicating that the ability to integrate information from both EEG and video modalities is crucial for achieving high performance. The drop in performance is especially noticeable in the recall and F1 score metrics, where the cross-modal attention helps in identifying complex patterns across different modalities. The removal of adaptive attention also leads to a notable decrease in performance, particularly in inference time and training time. The adaptive attention mechanism allows the model to dynamically focus on important parts of the data, improving efficiency. Without it, the model processes unnecessary information, resulting in higher computational costs and lower accuracy. Finally, the EEG stream component contributes significantly to performance, especially on the DEAP dataset, where removing it leads to increased inference time and a drop in accuracy. Overall, the full model, which combines all components, achieves the best results across all metrics, showing that each module plays a crucial role in optimizing both accuracy and computational efficiency.
Table 4 and Figure 5 presents the results of the ablation study on the eSports Sensors and MODA datasets, and the findings reinforce the importance of each model component. The cross-modal attention mechanism contributes significantly to the model's performance on both datasets, particularly in accuracy and recall. For example, removing cross-modal attention from the model leads to a drop in accuracy from 97.67 to 89.98% on the eSports Sensors dataset, and from 97.13 to 89.64% on the MODA dataset. This highlights the importance of integrating information from both the EEG and video streams to accurately recognize actions and states in real-time environments. The adaptive attention mechanism also plays a key role in optimizing the model's performance. Without it, the model's recall and F1 scores drop across both datasets, indicating that the ability to focus on the most important features in the data is crucial for accurate predictions. The EEG stream is particularly important for capturing the subtle neural patterns related to performance and actions in eSports and athletic datasets. Without the EEG stream, the model's accuracy and recall drop significantly, showing that the fusion of both EEG and video data is essential for capturing complex multimodal interactions. Overall, the full model outperforms all ablation variants, demonstrating that the combined effect of cross-modal attention, adaptive attention, and EEG stream fusion is necessary for achieving state-of-the-art performance in real-time applications.
Table 5 and Figure 6 shows the comparative results of our model with other SOTA methods on the Sleep-EDF and CWL EEG/fMRI datasets. The results clearly demonstrate the significant leading advantage of our model on all evaluation metrics, further validating its generalizability across different datasets. On the Sleep-EDF dataset, our model achieved an accuracy of 97.97%, a recall of 94.46%, an F1 score of 92.72%, and an AUC of 96.58%, all significantly superior to other methods. For example, compared to the CLIP model, which achieved an accuracy of 94.31%, our model shows an improvement of ~3.66%, proving its advantages in handling complex EEG signals. Moreover, the increases in recall and F1 score indicate a higher robustness of the model in sample balancing and fine-grained classification. On the CWL EEG/fMRI dataset, despite the dataset's challenge due to the heterogeneity of multimodal signals, our model still achieved an accuracy of 97.25% and a recall of 95.58%, which are 1 and 5.52% higher, respectively, than the BLIP model. Particularly, the AUC reached 96.68%, demonstrating comprehensive optimization in classification accuracy and stability. These experimental results verify the cross-dataset adaptability and multimodal data fusion capabilities of our model. Through adaptive attention mechanisms and efficient cross-modal feature extraction, our model not only surpasses existing methods in performance but also shows significant potential in computational efficiency and diverse data applications (Yuan et al., 2021).
Table 5. Comparison of SOTA methods on sleep-EDF (Korkalainen et al., 2019) and CWL EEG/fMRI datasets (Korkalainen et al., 2019).
We further validated the performance of our model by comparing it with six recently published state-of-the-art (SOTA) models, including AM-EEGNet, CareSleepNet, CoAtNet, CrossViT, EEG-Deformer, and DuA. These models are widely applied in the SEED and DEAP datasets and represent the latest advancements in the field. As shown in Table 6, our model significantly outperforms these comparative methods across all metrics. On the SEED dataset, our model achieved accuracy, recall, F1 score, and AUC of 97.23, 94.68, 93.89, and 95.74%, respectively, with the recall rate being 3.81% higher than the second-best model, DuA. On the DEAP dataset, our model achieved an accuracy of 98.43%, a recall of 93.9%, and an AUC of 96.24%, comprehensively surpassing other comparative models. These results demonstrate the significant advantages of our model in classification performance, robustness, and multimodal signal processing capabilities, highlighting its exceptional performance in complex data analysis tasks.
5 Conclusion and discussion
The primary goal of this study is to address the complex challenges in sports performance analysis, particularly in real-time monitoring and recognition of movements by integrating electroencephalogram (EEG) signals and video data. Traditional methods often fall short when handling multimodal data, especially in capturing cross-modal dependencies and ensuring real-time processing. To this end, we propose a novel EEG-driven model called the “Cerebral Transformer." This model effectively integrates EEG signals and video data through adaptive attention mechanisms and cross-modal fusion for precise analysis of sports performance. In our experiments, we validated the model using the SEED, DEAP, eSports Sensors, and MODA datasets. The results showed that our model outperformed six state-of-the-art (SOTA) models in terms of accuracy, recall, and F1 score. Additionally, ablation studies revealed that the cross-modal attention mechanism and adaptive attention mechanism significantly impact the model's performance, especially in efficiently processing the fusion of EEG signals and video data. Our approach achieved faster inference and training times, maintaining low parameter count and minimal floating-point operations, making it suitable for real-time sports monitoring scenarios.
Despite the significant experimental results, the model still has some limitations. We acknowledge several limitations in our study that warrant further discussion. First, the datasets used in our experiments, such as SEED, DEAP, eSports Sensors, and MODA, may introduce inherent biases due to their specific experimental setups and participant demographics. These biases could limit the generalizability of our model to broader populations or diverse real-world scenarios. For instance, the SEED and DEAP datasets primarily include controlled laboratory settings, which may not fully capture the variability of real-world conditions. Second, while our model is designed to process multimodal data effectively, real-time application poses significant challenges. These include the need for low-latency data processing, robust handling of noisy or incomplete signals, and ensuring computational efficiency on resource-constrained devices. Although we evaluated computational metrics such as inference time and FLOPs, further work is needed to optimize the model for real-time deployment without compromising accuracy. Lastly, while we demonstrated the efficacy of our model on a range of tasks, additional evaluation on larger and more diverse datasets, as well as under real-world conditions, is necessary to confirm its robustness and reliability. Future work could address these limitations by incorporating more diverse datasets, exploring domain adaptation techniques, and optimizing the model for deployment on edge devices.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Author contributions
QS: Writing – original draft, Writing – review & editing.
Funding
The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.
Conflict of interest
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Al-qaness, M. A., Dahou, A., Trouba, N. T., Abd Elaziz, M., and Helmi, A. M. (2024). TCN-inception: temporal convolutional network and inception modules for sensor-based human activity recognition. Fut. Gen. Comp. Syst. 160, 375–388. doi: 10.1016/j.future.2024.06.016
Cao, Y., and Li, X. (2021). EEG-based neural dynamics for athlete performance monitoring: a review. J. Sports Sci. Med. 20, 501–516. Available at: https://ieeexplore.ieee.org/abstract/document/9328561
Cheng, C., Liu, W., Feng, L., and Jia, Z. (2024). Emotion recognition using hierarchical spatial-temporal learning transformer from regional to global brain. Neural Netw. 179:106624. doi: 10.1016/j.neunet.2024.106624
Cheng, S., Jin, Y., Liang, Y., Jiang, L., and Wang, H. (2024). An efficient grasping shared control architecture for unpredictable and unspecified tasks. Front. Neurorobot. 18:1429952. doi: 10.3389/fnbot.2024.1429952
Cote, S., and Whelan, M. (2021). EEG-based performance optimization in sports rehabilitation: techniques and trends. IEEE Access 9, 21365–21375. Available at: https://ieeexplore.ieee.org/abstract/document/9328561
Dao, F., Zeng, Y., and Qian, J. (2024). Fault diagnosis of hydro-turbine via the incorporation of bayesian algorithm optimized cnn-lstm neural network. Energy 290:130326. doi: 10.1016/j.energy.2024.130326
Ding, Y., Li, Y., Sun, H., Liu, R., Tong, C., Liu, C., et al. (2024). EEG-deformer: a dense convolutional transformer for brain-computer interfaces. arXiv [preprint]. doi: 10.1109/JBHI.2024.3504604
Duan, J., and Xiao, T. (2023). Deep learning for EEG-based prediction of athlete performance: a survey. Expert Syst. Appl. 211:118452. doi: 10.1016/j.eswa.2022.118548
Friesen, M., and Park, H. (2022). Monitoring athlete focus and fatigue through eeg data: challenges and advancements. Front. Neurosci. 16:810022. Available at: https://iopscience.iop.org/article/10.1088/1741-2552/ac1982/meta
Fuentes, M., and Gomez, L. (2022). Symbolic methods for EEG data analysis in complex athletic environments. Cogn. Syst. Res. 75, 1–16. doi: 10.1016/j.cogsys.2022.05.002
Hu, S., Lei, B., Wang, S., Wang, Y., Feng, Z., and Shen, Y. (2021). Bidirectional mapping generative adversarial networks for brain MR to PET synthesis. IEEE Trans. Med. Imaging 41, 145–157. doi: 10.1109/TMI.2021.3107013
Gao, Y., and Li, Y. (2023). Deep learning for sports EEG data: challenges and applications. Neural Netw. 160, 107–122. Available at: https://link.springer.com/article/10.1007/s00521-021-06118-z
Jia, Z., Liang, H., Liu, Y., Wang, H., and Jiang, T. (2024). Distillsleepnet: heterogeneous multi-level knowledge distillation via teacher assistant for sleep staging. IEEE Transact. Big Data. doi: 10.1109/TBDATA.2024.3453763
Korkalainen, H., Aakko, J., Nikkonen, S., Kainulainen, S., Leino, A., Duce, B., et al. (2019). Accurate deep learning-based sleep staging in a clinical population with suspected obstructive sleep apnea. IEEE J. Biomed. Health Inf. 24, 2073–2081. doi: 10.1109/JBHI.2019.2951346
Lee, J., and Kang, S. (2020). Data-driven models for EEG-based athletic performance monitoring. IEEE Transact. Neural Netw. Learn. Syst. 31, 3308–3319.
Li, H., and Zhou, W. (2022). EEG classification in sports using support vector machines: a performance comparison. Comput. Intell. Neurosci. 2022. Available at: https://www.tandfonline.com/doi/full/10.1080/08839514.2021.2008612
Li, M., and Wu, J. (2023). EEG-based athlete performance prediction using pre-trained deep learning models. Sensors 23. Available at: https://www.frontiersin.org/journals/human-neuroscience/articles/10.3389/fnhum.2023.1194751/full
Li, Q., and Sun, P. (2021). Feature extraction methods for eeg data in sports training analysis. IEEE Access 9, 110235–110245.
Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedigos, I., et al. (2024). Jamba: a hybrid transformer-mamba language model. arXiv [preprint]. doi: 10.48550/arXiv.2403.19887
Lin, P.-J., Li, W., Zhai, X., Sun, J., Pan, Y., Ji, L., et al. (2024). Am-EEGNET: an advanced multi-input deep learning framework for classifying stroke patient eeg task states. Neurocomputing 585:127622. doi: 10.1016/j.neucom.2024.127622
Liu, Y., Lin, L., Yu, F., Zhou, C., and Li, Y. (2023). “Moda: mapping-once audio-driven portrait animation with dual attentions”, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 23020–23029. Available at: https://openaccess.thecvf.com/content/ICCV2023/html/Liu_MODA_Mapping-Once_Audio-driven_Portrait_Animation_with_Dual_Attentions_ICCV_2023_paper.html
Ma, H., and Tang, L. (2022). Transfer learning for EEG-based performance prediction: a comprehensive review. Neurocomputing 503, 199–215.
Miller, C., Nagy, Z., and Schlueter, A. (2014). “A seed dataset for a public, temporal data repository for energy informatics research on commercial building performance,” in Proceedings of the 3rd Conf. on Future Energy Business & Energy Informatics (Rotterdam).
Minen, M. T., Lebowitz, N., Ekhtman, J., Oza, K., Yusaf, I., Katara, A., et al. (2023). A critical systematic review of k-12 neurology/neuroscience pipeline programs. Front. Med. 10:1281578. doi: 10.3389/fmed.2023.1281578
Neuwirth, L. S., and Emenike, B. U. (2024). Comment on “neurotoxicity and outcomes from developmental lead exposure: persistent or permanent?”. Environ. Health Perspect. 132:048001. doi: 10.1289/EHP14809
Neuwirth, L. S., Gökhan, N., Kaye, S., and Meehan, E. F. (2023). Taurine supplementation for 48-months improved glucose tolerance and changed atp-related enzymes in avians. Pharmacology 108, 599–606. doi: 10.1159/000533538
Neuwirth, L. S., and Whigham, K. (2023). “Only time will tell”: the underexplored impacts of lead poisoning and covid-19 on pre-existing aces in New York. Youth 3, 1212–1224. doi: 10.3390/youth3040077
Ning, X., Wang, J., Lin, Y., Cai, X., Chen, H., Gou, H., et al. (2023). Metaemotionnet: spatial-spectral-temporal based attention 3d dense network with meta-learning for EEG emotion recognition. IEEE Trans. Instrum. Meas. doi: 10.1109/TIM.2023.3338676
Pan, J., Zuo, Q., Wang, B., Chen, C. P., Lei, B., and Wang, S. (2024). Decgan: decoupling generative adversarial network for detecting abnormal neural circuits in Alzheimer's disease. IEEE Transact. Artif. Intell. doi: 10.1109/TAI.2024.3416420
Pan, Y., Liu, Q., Liu, Q., Zhang, L., Huang, G., Chen, X., et al. (2024). Dua: dual attentive transformer in long-term continuous eeg emotion analysis. arXiv [preprint]. doi: 10.48550/arXiv.2407.20519
Pang, R. Y., Santy, S., Just, R., and Reinecke, K. (2024). “Blip: facilitating the exploration of undesirable consequences of digital technologies” in Proceedings of the CHI Conference on Human Factors in Computing Systems, 1–18. Available at: https://dl.acm.org/doi/full/10.1145/3613904.3642054
Panyarak, W., Suttapak, W., Mahasantipiya, P., Charuakkra, A., Boonsong, N., Wantanajittikul, K., et al. (2024). CrossViT with ECAP: enhanced deep learning for jaw lesion classification. Int. J. Med. Inf. 193:105666. doi: 10.1016/j.ijmedinf.2024.105666
Parihar, R., and Acharya, R. (2021). Knowledge-based methods for EEG data interpretation in sports training. J. Artif. Intell. Res. 71, 115–137. Available at: https://www.frontiersin.org/articles/10.3389/fnins.2021.690633/full
Pilacinski, A., Christ, L., Boshoff, M., Iossifidis, I., Adler, P., Miro, M., et al. (2024). Human in the collaborative loop: a strategy for integrating human activity recognition and non-invasive brain-machine interfaces to control collaborative robots. Front. Neurorobot. 18:1383089. doi: 10.3389/fnbot.2024.1383089
Rao, H., and Zhang, H. (2023). EEG-based real-time feedback systems for injury prevention in athletes. Sensors 23, 250–266. Available at: https://www.mdpi.com/2079-6374/13/10/930
Roy, S., and Das, S. (2021). Deep learning approaches for real-time EEG analysis in sports training. IEEE Transact. Biomed. Eng. 68, 523–534.
Shah, P., and Kumar, R. (2022). Multimodal eeg and video data fusion for comprehensive athlete performance analysis. Pattern Recognit. 129. Available at: https://www.sciencedirect.com/science/article/pii/S1084804519303078
Smerdov, A., Zhou, B., Lukowicz, P., and Somov, A. (2020). Collection and validation of psychophysiological data from professional and amateur players: a multimodal esports dataset. arXiv [preprint]. doi: 10.48550/arXiv.2011.00958
Sun, W., and Gu, Y. (2023). Mitigating overfitting in EEG-based performance prediction models through data augmentation. Pattern Recognit. Lett. 161, 88–95.
Sun, Z., Fang, Y., Wu, T., Zhang, P., Zang, Y., Kong, S., et al. (2024). “Alpha-clip: a clip model focusing on wherever you want,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13019–13029. Available at: https://openaccess.thecvf.com/content/CVPR2024/html/Sun_Alpha-CLIP_A_CLIP_Model_Focusing_on_Wherever_You_Want_CVPR_2024_paper.html
Tripathi, S., Acharya, S., Sharma, R., Mittal, S., and Bhattacharya, S. (2017). “Using deep and convolutional neural networks for accurate emotion classification on deap data,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, 4746–4752. Available at: https://ojs.aaai.org/index.php/AAAI/article/view/19105
Wang, J., Zhao, S., Jiang, H., Zhou, Y., Yu, Z., Li, T., et al. (2024). CareSleepNet: a hybrid deep learning network for automatic sleep staging. IEEE J. Biomed. Health Inf . 1–14. doi: 10.1109/JBHI.2024.3426939
Wang, Y., and Song, L. (2021). Symbolic ai in eeg signal processing for athletic performance analysis: limitations and opportunities. Neurocomputing 447, 312–324.
Wang, Y., Zhu, C., Zuo, D., Liu, J., and Liu, D. (2023). The effect of emotional motivation on strategy flexibility: the moderating role of task load. Front. Psychol. 14:1241131. doi: 10.3389/fpsyg.2023.1241131
Xu, F., and Zhang, Y. (2021). Complexity challenges in deep learning for real-time EEG applications in sports. IEEE Transact. Comp. Soc. Syst. 8, 1445–1455. Available at: https://dl.acm.org/doi/abs/10.1145/3648469
Yang, S., Yin, Y., Qiu, Z., and Meng, Q. (2024). Research application of session-rpe in monitoring the training load of elite endurance athletes. Front. Neurosci. 18:1341972. doi: 10.3389/fnins.2024.1341972
You, S., Qin, Y., Yan, L., Zhang, H., Zhu, J., Yu, S., et al. (2024). “An improved method for pneumonia image classification based on coatnet,” in 2024 5th International Conference on Computer Engineering and Application (ICCEA) (IEEE), 1052–1056. Available at: https://ieeexplore.ieee.org/abstract/document/10603452
Yu, W., Lei, B., Wang, S., Liu, Y., Feng, Z., Hu, Y., et al. (2022). Morphological feature visualization of Alzheimer's disease via multidirectional perception gan. IEEE Transact. Neural Netw. Learn. Syst. 34, 4401–4415. doi: 10.1109/TNNLS.2021.3118369
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.-H., et al. (2021). “Tokens-to-token vit: training vision transformers from scratch on imagenet,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 558–567. Available at: https://openaccess.thecvf.com/content/ICCV2023/html/Liu_MODA_Mapping-Once_Audio-driven_Portrait_Animation_with_Dual_Attentions_ICCV_2023_paper.html
Zhang, H., and Chen, L. (2023). Automatic feature extraction from eeg data using deep neural networks for athlete performance monitoring. IEEE Access 11, 3208–3221.
Zhang, L., and Jiang, M. (2020). Neural dynamics of fatigue detection in sports using EEG signals. IEEE Transact. Neural Syst. Rehabil. Eng. 28, 1562–1570.
Zhang, M., and Zhao, L. (2021). Machine learning applications in EEG-based athletic performance prediction. Neural Comp. Appl. 33, 10345–10358. Available at: https://www.mdpi.com/2076-3425/11/6/698
Keywords: EEG signals, sports performance analysis, cross-modal fusion, attention mechanism, transformer
Citation: Sun Q (2024) EEG-powered cerebral transformer for athletic performance. Front. Neurorobot. 18:1499734. doi: 10.3389/fnbot.2024.1499734
Received: 21 September 2024; Accepted: 22 November 2024;
Published: 20 December 2024.
Edited by:
Andrea Slézia, Hungarian Research Network, HungaryReviewed by:
Shuqiang Wang, Chinese Academy of Sciences (CAS), ChinaZiyu Jia, Chinese Academy of Sciences (CAS), China
Copyright © 2024 Sun. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Qikai Sun, eWluc2ViZW5jaGkmI3gwMDA0MDsxNjMuY29t