MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

Tianyu Xu1,2,*   Jiawei Chen1,*   Jiazhao Zhang1,2,*   Wenyao Zhang2,3   Zekun Qi2,4   Minghan Li2   Jiahang Liu1,2   Lu Yue1,2  Zhizheng Zhang2,†   He Wang1,2,†
1Peking University       2Galbot       3Shanghai Jiao Tong University       4Tsinghua University
*Joint First Author      †Corresponding Author
European Conference on Computer Vision (ECCV 2026)

Summary Video

We introduce MM-Nav, a 7B multi-view VLA model for robust visual navigation with 360° observation and 7 Hz inference. MM-Nav learns from 1.5M expert demonstrations collected from three privileged RL teachers specialized in reaching, squeezing, and avoiding, and co-trains on large-scale real-world VQA data to reduce the sim-to-real gap. On the InternVLA-N1 System-1 point-goal navigation benchmark, MM-Nav achieves an 88.1% success rate.

MM-Nav overview, real-world deployment, and hardware setup

Method

Our pipeline comprises two steps: (1) training multiple capability-specific RL teachers and
initial VLA fine-tuning, (2) iterative teacher-student refinement with capability-balanced data aggregation.

MM-Nav training pipeline

Stage 1

We first collect simulated expert trajectories and real-world VQA data
to initialize the VLA navigation policy.

Expert and VQA data composition

Stage 2

We then deploy the student in simulation, query RL teachers online,
and update the data mixture with capability balancing.

Iterative expert data collection stages


The second stage is repeated iteratively, where the training data ratio is balanced dynamically.

Success rate across online training iterations
Capability-balanced data proportion across iterations

Public Benchmark Results

Results on the InternVLA-N1 System-1 point-goal navigation benchmark.
We report success rate (SR) and success weighted by path length (SPL).

Methods Obs. Home Commercial
SR ↑ SPL ↑ SR ↑ SPL ↑
DD-PPO RGB-D 0.4 0.4 5.3 5.2
iPlanner Depth 43.0 40.6 54.6 52.8
ViPlanner RGB-D 45.0 43.2 63.7 61.9
LoGoPlanner RGB-D 57.3 52.4 67.1 63.9
InternVLA-N1(S1) RGB-D 60.0 55.6 71.4 68.2
NavDP RGB-D 60.3 54.7 74.1 70.5
SIDP RGB-D 63.2 56.5 81.2 73.4
Mixed-RL Depth 22.0 10.4 26.9 22.8
Ours (without VQA) RGB 76.4 73.7 73.9 72.6
Ours (single view) RGB 79.9 77.6 86.7 84.7
Ours RGB 86.3 81.1 89.9 85.5

Simulation Scenes

For each expert, we run 64 parallel robots in their respective simulation environments,
recording the observations, goals, and corresponding expert actions.

Reaching Scene

Squeezing Scene

Avoiding Scene

Real-world Experiments

MM-Nav transfers zero-shot to diverse real-world scenes, including thin wire avoidance,
cluttered corridors, pedestrian avoidance, and outdoor alleys.

MM-Nav real-world experiments and VQA ablation

VQA Co-training Analysis

Adding real-world VQA data progressively aligns simulated and real-world latent distributions,
reducing the gap-axis distance from 1.86 to 0.38.

Latent distribution analysis for VQA co-training ratios

BibTeX

@article{xu2025mm,
  title={MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning},
  author={Xu, Tianyu and Chen, Jiawei and Zhang, Jiazhao and Zhang, Wenyao and Qi, Zekun and Li, Minghan and Liu, Jiahang and Yue, Lu and Zhang, Zhizheng and Wang, He},
  journal={arXiv preprint arXiv:2510.03142},
  year={2025}
}