The final, formatted version of the article will be published soon.
ORIGINAL RESEARCH article
Front. Comput. Sci.
Sec. Computer Vision
Volume 7 - 2025 |
doi: 10.3389/fcomp.2025.1510252
This article is part of the Research Topic Foundation Models for Healthcare: Innovations in Generative AI, Computer Vision, Language Models, and Multimodal Systems View all 4 articles
MoNetViT: An Efficient Fusion of CNN and Transformer Technologies for Visual Navigation Assistance with Multi Query Attention
Provisionally accepted- 1 Politeknik Negeri Semarang, Semarang, Indonesia
- 2 Diponegoro University, Semarang, Central Java, Indonesia
Aruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini-MobileNet MobileViT), a lightweight model combining CNNs and MobileViT in a dual-path encoder to optimize global and spatial image details. This design reduces complexity and boosts segmentation performance. The addition of a multi-query attention (MQA) module enhances multi-scale feature integration, allowing end-to-end learning guided by ground truth. Experiments show MoNetViT outperforms other semantic segmentation algorithms in efficiency and effectiveness, particularly in detecting Aruco markers, making it a promising tool to improve navigation aids for the visually impaired.
Keywords: indoor navigation1, computer vision2, markers3, assistive technology4, mobile devices5
Received: 12 Oct 2024; Accepted: 21 Jan 2025.
Copyright: © 2025 Triyono, Gernowo and Prayitno. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Liliek Triyono, Politeknik Negeri Semarang, Semarang, Indonesia
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.