OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping

Tsinghua University · Central South University · Inspur Yunzhou Industrial Internet Co., Ltd
ACM MM 2025

*Indicates Equal Contribution
OpenMap Overview

OpenMap constructs an open-vocabulary visual-language map. (a) OpenMap performs fine-grained, instance-level semantic mapping on navigation scenes from Matterport3D. (b) OpenMap accurately grounds generic instructions to the targets, where darker regions in the heatmaps indicate stronger alignment between the instruction and the predicted instance.

Abstract

Grounding natural language instructions to visual observations is fundamental for embodied agents operating in open-world environments. Recent advances in visual-language mapping have enabled generalizable semantic representations by leveraging vision-language models (VLMs). However, these methods often fall short in aligning free-form language commands with specific scene instances, due to limitations in both instance-level semantic consistency and instruction interpretation.

We present OpenMap, a zero-shot open-vocabulary visual-language map designed for accurate instruction grounding in navigation tasks. To address semantic inconsistencies across views, we introduce a Structural-Semantic Consensus constraint that jointly considers global geometric structure and vision-language similarity to guide robust 3D instance-level aggregation. To improve instruction interpretation, we propose an LLM-assisted Instruction-to-Instance Grounding module that enables fine-grained instance selection by incorporating spatial context and expressive target descriptions.

We evaluate OpenMap on ScanNet200 and Matterport3D, covering both semantic mapping and instruction-to-target retrieval tasks. Experimental results show that OpenMap outperforms state-of-the-art baselines in zero-shot settings, demonstrating the effectiveness of our method in bridging free-form language and 3D perception for embodied navigation.

Presentation

BibTeX🙏🏻

@inproceedings{li2025openmap,
        title={OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping},
        author={Li, Danyang and Yang, Zenghui and Qi, Guangpeng and Pang, Songtao and Shang, Guangyong and Ma, Qiang and Yang, Zheng},
        booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
        pages={7444--7452},
        year={2025}
}