Abstract
Grounding natural language instructions to visual observations is fundamental for embodied agents operating in open-world environments. Recent advances in visual-language mapping have enabled generalizable semantic representations by leveraging vision-language models (VLMs). However, these methods often fall short in aligning free-form language commands with specific scene instances, due to limitations in both instance-level semantic consistency and instruction interpretation.
We present OpenMap, a zero-shot open-vocabulary visual-language map designed for accurate instruction grounding in navigation tasks. To address semantic inconsistencies across views, we introduce a Structural-Semantic Consensus constraint that jointly considers global geometric structure and vision-language similarity to guide robust 3D instance-level aggregation. To improve instruction interpretation, we propose an LLM-assisted Instruction-to-Instance Grounding module that enables fine-grained instance selection by incorporating spatial context and expressive target descriptions.
We evaluate OpenMap on ScanNet200 and Matterport3D, covering both semantic mapping and instruction-to-target retrieval tasks. Experimental results show that OpenMap outperforms state-of-the-art baselines in zero-shot settings, demonstrating the effectiveness of our method in bridging free-form language and 3D perception for embodied navigation.
Presentation
Semantic mapping results of OpenMap on Matterport3D.
Semantic mapping results of OpenMap on ScanNet200.
Semantic mapping and instruction grounding result of OpenMap on ScanNet200.
OpenMap Overview. OpenMap takes RGB-D inputs from multiple viewpoints and applies pretrained models to predict 2D masks and extract open-vocabulary features. During semantic mapping (§3.2), it iteratively aggregates 2D masks into 3D instances using a structural-semantic consensus constraint. During instruction grounding (§3.3), an LLM selects the target instance by reasoning over candidate proposals and scene context provided by OpenMap.
Instruction-to-Instance Grounding pipeline. In the first round, the LLM generates a precise description of the navigation target and retrieves candidate instances from OpenMap. In the second round, it reasons over the candidates and their surrounding context to infer the final target instance.
BibTeX🙏🏻
@inproceedings{li2025openmap,
title={OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping},
author={Li, Danyang and Yang, Zenghui and Qi, Guangpeng and Pang, Songtao and Shang, Guangyong and Ma, Qiang and Yang, Zheng},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={7444--7452},
year={2025}
}