Can VLMs predict how each camera move changes the view, and plan many such moves ahead?
We call this capability view planning, requiring (1) understanding how a single
action transforms the view, and (2) composing many such transformations across
multi-turn plans to identify a target view.
We probe both abilities in ViewSuite, a 3D point-cloud environment on real ScanNet scenes.
Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail
to compose it across multi-turn plans, with the gap widening as viewpoint distance grows.
To close this gap, we propose an iterative framework that alternates self-exploration
with view graph distillation. The key insight is that even failed trajectories
encode valid view transitions: moving from viewpoint A to B is useful supervision regardless of the
original goal. This improves Qwen2.5-VL-7B from 2.5% → 47.8% on interactive view
planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%).
Three Diagnostic Tasks
ViewSuite probes view planning along two coupled axes:
understanding single-step view transitions, and
composing them across multi-turn plans.
P2V
Path‑to‑View
Single-turn · 4-way MCQ · forward simulation
Given an initial view, a top-down reference, and an action sequence, the model must predict the
resulting view from four options. Tests whether the model can mentally simulate viewpoint transitions.
Action:[turn_right × 5] ·
Step: 0.5 m / 30° ·
GPT-5.4 Pro answer: C (incorrect)
InitialTop-downABC (GPT-5.4 Pro)D
V2P
View‑to‑Path
Single-turn · 4-way MCQ · inverse reasoning
Given initial and target views plus a top-down view, identify which action sequence was executed,
again from four options. P2V and V2P together probe view-action understanding in both directions.
Options:
A. [look_up, move_forward, move_left] ·
B. [turn_left × 5, move_left] ·
C. [turn_right × 2, move_forward, move_left × 5, move_up] ·
D. [turn_left × 2] GPT-5.4 Pro answer: B (correct)
InitialTop-downTarget
IVP
Interactive View Planning
Multi-turn · 6-DoF estimate · the composition stress-test
Given initial, target, and top-down views, the agent issues camera-control actions per turn,
observes the resulting view, and within a turn budget submits a 6-DoF estimate of where the target
view was taken. Unlike single-turn P2V/V2P, IVP requires planning a sequence of view changes.
Each iteration alternates two stages. In the self-exploration stage, the agent
interacts with ViewSuite environments and its trajectories are incrementally compressed into a
view graph. In the view graph distillation stage, paths are sampled
from this graph and reformulated into diverse view-planning demonstrations used to fine-tune the
policy. The resulting model initializes the next self-exploration stage.
1. RL Stage
The agent runs IVP rollouts on ViewSuite environments with PPO. Reward is sparse:
+1 when the submitted target estimate is within 0.5 m / 30° of
the ground truth, plus a small format reward. Even with success rate near 2.5%,
every rollout is useful, since it streams into the graph builder.
2. Graph Construction
A background process incrementally compresses every completed trajectory into a
view graph. Nodes are viewpoints (with their rendered views);
edges are actions between viewpoints. Nodes and
edges are deduplicated via viewpoint similarity, so success and failure
alike contribute to one shared structured graph.
3. Task Reformulation
Any path P = (v₀, a₁, v₁, …, aK, vK) in the graph yields a
valid IVP demonstration regardless of whether the original episode succeeded: end node →
target, start node → initial view, action chain → labeled plan. This is the lever that
lets us learn from failed episodes.
4. SFT Stage
Sampled paths are reformulated into supervised view-planning demonstrations and used to
fine-tune the policy with standard cross-entropy. The resulting model initializes the
next RL stage, kicking off the iteration. Stages alternate RL → SFT → RL → SFT.
Results
Frontier VLM benchmark on ViewSuite-5K test (530 view pairs)
Accuracy / Success Rate (%) on Short (d < 3) and Long (d ≥ 3) splits.
Best in each column is bold.
Model
Path-to-View
View-to-Path
Interactive View Planning
Overall
Short
Long
All
Short
Long
All
Short
Long
All
Random Response
20.7
24.6
23.3
24.3
26.5
25.7
2.2
0.0
0.8
16.6
Proprietary Models
GPT-5.4 Pro
70.7
43.8
53.1
72.4
39.0
50.7
32.6
11.0
18.5
40.8
Gemini 3.1 Pro
63.6
40.9
48.8
53.0
47.7
49.5
28.8
17.4
21.4
39.9
GPT-5.4
57.1
42.9
47.8
60.5
37.5
45.6
33.7
7.5
16.6
36.7
Grok 4.20 Beta
61.4
38.0
46.1
44.9
44.5
44.6
17.4
2.9
7.9
32.9
GPT-5.1
60.3
35.1
43.9
52.4
33.4
40.1
12.0
3.2
6.2
30.1
Claude Opus 4.6
46.7
28.4
34.8
47.6
38.4
41.6
23.9
3.8
10.8
29.0
Gemini 3 Pro
50.5
31.0
37.8
44.9
35.5
38.8
13.6
7.0
9.3
28.6
Open-Weight Models
Qwen3.5-397B
57.6
30.1
39.7
44.3
30.8
35.5
12.5
0.0
4.3
26.5
GLM-4.6V
36.4
23.2
27.8
31.4
29.7
30.2
9.2
1.2
4.0
20.7
Qwen2.5-VL-72B
28.3
29.3
28.9
35.7
29.9
31.9
2.2
0.6
1.1
20.7
Qwen3-VL-32B
27.2
27.5
27.4
41.1
28.5
32.9
4.3
0.0
1.5
20.6
Kimi K2.5
35.9
24.6
28.5
18.4
29.4
25.5
4.9
1.2
2.5
18.8
Qwen2.5-VL-7B
23.9
32.5
29.5
27.0
22.7
24.2
7.1
0.0
2.5
18.7
Training results on IVP — Qwen2.5-VL-7B base
Success rate (%) under the calibrated 0.5 m / 30° threshold. Our framework lifts a 7B model
from 2.5% → 47.8%, beating every proprietary VLM evaluated.
Single-turn understanding ≫ multi-turn planning. The best VLMs reach ~70% on short-horizon P2V/V2P
but collapse to ≤21% on Interactive View Planning. Most models score below 10%;
on long-horizon samples most fall below 3%.
P2V/V2P degrade primarily with rotation distance (cumulative rotations are hard to simulate
mentally). IVP reverses this: success collapses with position distance, since 3D translation requires
spatial layout understanding and path planning beyond simple orientation control.
2. Failed trajectories still teach view transitions
Direct PPO plateaus at 3.2%; GRPO with reward-variance filtering reaches only
5.2%; even iterating PPO with SFT on the small set of successful trajectories
(Success-Only Bootstrapping) gets to 6.2%. The breakthrough comes from
recognizing that even failed trajectories encode valid view transitions: moving from viewpoint
A to B is meaningful supervision regardless of the original goal.
3. Self-Exploration + View Graph Distillation closes the gap
Our iterative framework alternates self-exploration with view graph distillation. Condensing all
exploration trajectories (including failures) into a structured graph, then reformulating sampled
paths into supervised view-planning demonstrations, takes Qwen2.5-VL-7B from
2.5% → 47.8% on IVP, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%).
Random-graph ablation collapses to 13.0%, confirming on-policy graph construction is critical.
4. The trained agent learns to explore then approach
Tracked 3D point-cloud coverage reveals a clean two-phase strategy: scene coverage grows rapidly
in early turns as the agent looks around, then plateaus while target intersection ratio
accelerates in the middle turns as the agent moves toward the target. Base and frontier models
show flat or erratic target coverage instead.
5. View-planning priors transfer to other view-related tasks
Under identical GRPO post-training, our trained model beats its base counterpart by
8 to 12 points on P2V/V2P (within ViewSuite) and by ~10 points on
the external MindCube benchmark (no shared scenes / actions / rendering). Interactive view planning
is not a narrow skill: the learned spatial priors strengthen view-dependent reasoning both within
and beyond ViewSuite.
Cite
If you use ViewSuite or its trained models, please cite our paper.
@article{wang2026viewsuite,
title = {VLMs Walk the Scene: View Planning via Scene Self-Exploration},
author = {Wang, Kangrui and Li, Linjie and Yang, Zhengyuan and Chen, Shiqi and Wang, Zihan and Fei-Fei, Li and Wu, Jiajun and Guibas, Leonidas and Wang, Lijuan and Li, Manling},
year = {2026}
}