AUTHOR=Wang Haiyan TITLE=Multimodal audio-visual robot fusing 3D CNN and CRNN for player behavior recognition and prediction in basketball matches JOURNAL=Frontiers in Neurorobotics VOLUME=Volume 18 - 2024 YEAR=2024 URL=https://www.frontiersin.org/journals/neurorobotics/articles/10.3389/fnbot.2024.1284175 DOI=10.3389/fnbot.2024.1284175 ISSN=1662-5218 ABSTRACT=In the field of deep learning, the analysis of multimodal data has gained significant attention. This study introduces a multimodal audio-visual robotic framework that combines 3D CNN, CRNN, and LSTM to recognize and predict player behavior in basketball games. For visual analysis, 3D CNN captures spatiotemporal information from video frames, enabling the identification of player movements and positions. In the auditory domain, CRNN analyzes real-time speech, providing contextual insights into the game. These modalities are integrated through a multi-modal fusion layer for comprehensive data analysis. In the realm of action recognition and prediction, LSTM plays a pivotal role. The model classifies player actions and learns feature representations, followed by LSTM modeling historical sequences to predict future actions. Our model excels in efficiency, with training times of 800 seconds for NBA PTD and 700 seconds for SD, as well as inference times of 4 milliseconds for NBA PTD and 3.5 milliseconds for SD. It outperforms other models across datasets in terms of RMSE, MAPE, MAE, and R2, showcasing its robustness and generalizability. Particularly, R2 consistently surpasses other models, indicating a strong correlation between our model's predictions and actual player behavior. This study provides a powerful approach for real-time basketball game analysis and offers insights into the potential of multimodal data in the realm of deep learning.