Skip to main content

ORIGINAL RESEARCH article

Front. Neurorobot.
Volume 18 - 2024 | doi: 10.3389/fnbot.2024.1513354
This article is part of the Research Topic Multi-modal Learning with Large-scale Models View all 6 articles

NavBLIP: A Visual-Language Model for Enhancing Unmanned Aerial Vehicles Navigation and Object Detection

Provisionally accepted
  • Baotou Iron and Steel Vocational Technical College, Baotou, China

The final, formatted version of the article will be published soon.

    In recent years, Unmanned Aerial Vehicles (UAVs) have increasingly been deployed in various applications such as autonomous navigation, surveillance, and object detection. Traditional methods for UAV navigation and object detection have often relied on either handcrafted features or unimodal deep learning approaches. While these methods have seen some success, they frequently encounter limitations in dynamic environments, where robustness and computational efficiency become critical for real-time performance. Additionally, these methods often fail to effectively integrate multimodal inputs, which restricts their adaptability and generalization capabilities when facing complex and diverse scenarios. To address these challenges, we introduce NavBLIP, a novel visual-language model specifically designed to enhance UAV navigation and object detection by utilizing multimodal data. NavBLIP incorporates transfer learning techniques along with a Nuisance-Invariant Multimodal Feature Extraction (NIMFE) module. The NIMFE module plays a key role in disentangling relevant features from intricate visual and environmental inputs, allowing UAVs to swiftly adapt to new environments and improve object detection accuracy. Furthermore, NavBLIP employs a multimodal control strategy that dynamically selects context-specific features to optimize real-time performance, ensuring efficiency in high-stakes operations. Extensive experiments on benchmark datasets such as RefCOCO, CC12M, and OpenImages reveal that NavBLIP outperforms existing state-of-the-art models in terms of accuracy, recall, and computational efficiency. Additionally, our ablation study emphasizes the significance of the NIMFE and transfer learning components in boosting the model's performance, underscoring NavBLIP's potential for real-time UAV applications where adaptability and computational efficiency are paramount.

    Keywords: UAV navigation, object detection, multimodal learning, Transfer Learning, Computational efficiency

    Received: 18 Oct 2024; Accepted: 27 Nov 2024.

    Copyright: © 2024 Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Rufeng Chen, Baotou Iron and Steel Vocational Technical College, Baotou, China

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.