A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision

Alexey Magay, Dhurba Tripathi, Yu Hao, Yi Fang*
Embodied AI and Robotics (AIR) Lab, New York University Abu Dhabi, UAE
*Corresponding Author
ACVR 2024

Overview

Overview

The lightweight camera mounts on glasses, while the fine-tuned MLLM provides spatially aware feedback to users.

Abstract

People with blindness and low vision (pBLV) face significant challenges, struggling to navigate environments and locate objects due to limited visual cues. Spatial reasoning is crucial for these individuals, as it enables them to understand and interpret the spatial relationships in their surroundings, enhancing their ability to navigate and interact more safely and independently. Current multi-modal large language (MLLM) models for low vision people lack the spatial reasoning capabilities needed to effectively assist in these tasks. Moreover, there is a notable absence of lightweight, easy-to-use systems that allow pBLV to effectively perceive and interact with their surrounding environment. In this paper, we propose a novel spatial enhanced multi-modal large language model based approach for visually impaired individuals. By fine-tuning the MLLM to incorporate spatial reasoning capabilities, our method significantly improves the understanding of environmental context, which is critical for navigation and object recognition. The innovation extends to a hardware component, designed as an attachment for glasses, ensuring increased accessibility and ease of use. This integration leverages advanced VLMs to interpret visual data and provide real-time, spatially aware feedback to the user. Our approach aims to bridge the gap between advanced machine learning models and practical, user-friendly assistive devices, offering a robust solution for visually impaired users to navigate their surroundings more effectively and independently. The paper includes an in-depth evaluation using the VizWiz dataset, demonstrating substantial improvements in accuracy and user experience. Additionally, we design a comprehensive dataset to evaluate our method's effectiveness in real-world situations, demonstrating substantial improvements in accuracy and user experience.

System Pipeline

Pipeline

System flow with wearable camera, question input, and LLaVA-based answer generation for spatially grounded interaction.

Results

Results 1

LVSQA dataset examples: spatial navigation, distance estimation, and relationships.

Results 2

App interface supporting audio and text queries, and fine-tuned model responses.

BibTex


        @inproceedings{magay2024lv,
          title={A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision},
          author={Magay, Alexey and Tripathi, Dhurba and Hao, Yu and Fang, Yi},
          booktitle={International Workshop on Assistive Computer Vision and Robotics (ACVR)},
          year={2024},
          organization={ECCV}
        }