Bridging Vision and Touch: Advancing Robotic Interaction Prediction with Self-Supervised Multimodal Learning

Li, Luchen; George Thuruthel, Thomas

doi:10.3389/frobt.2024.1407519

ORIGINAL RESEARCH article

Front. Robot. AI

Sec. Biomedical Robotics

Volume 11 - 2024 | doi: 10.3389/frobt.2024.1407519

This article is part of the Research Topic Advancing Soft, Tactile and Haptic Technologies: Recent Developments for Healthcare Applications View all 5 articles

Bridging Vision and Touch: Advancing Robotic Interaction Prediction with Self-Supervised Multimodal Learning

Provisionally accepted

Luchen Li

Thomas George Thuruthel ^*

University College London, London, England, United Kingdom

The final, formatted version of the article will be published soon.

Predicting the consequences of the agent's actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.

Keywords: predictive learning, Self-supervised learning, Physical robotic interaction, Information fusion and compression, multi-modal sensing

Received: 26 Mar 2024; Accepted: 10 Sep 2024.

Copyright: © 2024 Li and George Thuruthel. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Thomas George Thuruthel, University College London, London, WC1E 6BT, England, United Kingdom

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

ORIGINAL RESEARCH article

Bridging Vision and Touch: Advancing Robotic Interaction Prediction with Self-Supervised Multimodal Learning

Select one of your emails

Notify me on publication