MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

Tianyu Xu^1,2,* Jiawei Chen^1,* Jiazhao Zhang^1,2,* Wenyao Zhang^2,3 Zekun Qi^2,4 Minghan Li² Jiahang Liu^1,2 Lu Yue^1,2 Zhizheng Zhang^2,† He Wang^1,2,†

¹Peking University ²Galbot ³Shanghai Jiao Tong University ⁴Tsinghua University

*Joint First Author †Corresponding Author

European Conference on Computer Vision (ECCV 2026)

arXiv Environments VLA Model

We introduce MM-Nav, a 7B multi-view VLA model for robust visual navigation with 360° observation and 7 Hz inference. MM-Nav learns from 1.5M expert demonstrations collected from three privileged RL teachers specialized in reaching, squeezing, and avoiding, and co-trains on large-scale real-world VQA data to reduce the sim-to-real gap. On the InternVLA-N1 System-1 point-goal navigation benchmark, MM-Nav achieves an 88.1% success rate.

MM-Nav overview, real-world deployment, and hardware setup

Thin Wire Avoidance

Dynamic Obstacle Avoidance

Cluttered Scenes Navigation

Outdoor Alley Navigation

Method

Our pipeline comprises two steps: (1) training multiple capability-specific RL teachers and
initial VLA fine-tuning, (2) iterative teacher-student refinement with capability-balanced data aggregation.

Stage 1

We first collect simulated expert trajectories and real-world VQA data
to initialize the VLA navigation policy.

Stage 2

We then deploy the student in simulation, query RL teachers online,
and update the data mixture with capability balancing.

The second stage is repeated iteratively, where the training data ratio is balanced dynamically.

Success rate across online training iterations

Capability-balanced data proportion across iterations

Public Benchmark Results

Results on the InternVLA-N1 System-1 point-goal navigation benchmark.
We report success rate (SR) and success weighted by path length (SPL).

Methods	Obs.	Home		Commercial
Methods	Obs.	SR ↑	SPL ↑	SR ↑	SPL ↑
DD-PPO	RGB-D	0.4	0.4	5.3	5.2
iPlanner	Depth	43.0	40.6	54.6	52.8
ViPlanner	RGB-D	45.0	43.2	63.7	61.9
LoGoPlanner	RGB-D	57.3	52.4	67.1	63.9
InternVLA-N1(S1)	RGB-D	60.0	55.6	71.4	68.2
NavDP	RGB-D	60.3	54.7	74.1	70.5
SIDP	RGB-D	63.2	56.5	81.2	73.4
Mixed-RL	Depth	22.0	10.4	26.9	22.8
Ours (without VQA)	RGB	76.4	73.7	73.9	72.6
Ours (single view)	RGB	79.9	77.6	86.7	84.7
Ours	RGB	86.3	81.1	89.9	85.5

Simulation Scenes

For each expert, we run 64 parallel robots in their respective simulation environments,
recording the observations, goals, and corresponding expert actions.

Reaching Scene

Squeezing Scene

Avoiding Scene

Indoor VQA Ablation

Outdoor VQA Ablation

Error Recovery

Narrow Passage Squeezing

Real-world Experiments

MM-Nav transfers zero-shot to diverse real-world scenes, including thin wire avoidance,
cluttered corridors, pedestrian avoidance, and outdoor alleys.

VQA Co-training Analysis

Adding real-world VQA data progressively aligns simulated and real-world latent distributions,
reducing the gap-axis distance from 1.86 to 0.38.

Latent distribution analysis for VQA co-training ratios

BibTeX

@article{xu2025mm,
  title={MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning},
  author={Xu, Tianyu and Chen, Jiawei and Zhang, Jiazhao and Zhang, Wenyao and Qi, Zekun and Li, Minghan and Liu, Jiahang and Yue, Lu and Zhang, Zhizheng and Wang, He},
  journal={arXiv preprint arXiv:2510.03142},
  year={2025}
}