ViewSuite

ViewSuite

VLMs Walk the Scene:
View Planning via Scene Self-Exploration

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1) understanding how a single action transforms the view, and (2) composing many such transformations across multi-turn plans to identify a target view.

We probe both abilities in ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows.

To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that even failed trajectories encode valid view transitions: moving from viewpoint A to B is useful supervision regardless of the original goal. This improves Qwen2.5-VL-7B from 2.5% → 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%).

ViewSuite environment overview

Three Diagnostic Tasks

ViewSuite probes view planning along two coupled axes: understanding single-step view transitions, and composing them across multi-turn plans.

P2V

Path‑to‑View

Single-turn · 4-way MCQ · forward simulation

Given an initial view, a top-down reference, and an action sequence, the model must predict the resulting view from four options. Tests whether the model can mentally simulate viewpoint transitions.

Action: [turn_right × 5]  ·  Step: 0.5 m / 30°  ·  GPT-5.4 Pro answer: C (incorrect)

Initial view
Initial
Top-down view
Top-down
Option A
A
Option B
B
Option C (GPT-5.4 Pro pick)
C (GPT-5.4 Pro)
Option D
D
V2P

View‑to‑Path

Single-turn · 4-way MCQ · inverse reasoning

Given initial and target views plus a top-down view, identify which action sequence was executed, again from four options. P2V and V2P together probe view-action understanding in both directions.

Options: A. [look_up, move_forward, move_left]  ·  B. [turn_left × 5, move_left]  ·  C. [turn_right × 2, move_forward, move_left × 5, move_up]  ·  D. [turn_left × 2]
GPT-5.4 Pro answer: B (correct)

Initial view
Initial
Top-down view
Top-down
Target view
Target
IVP

Interactive View Planning

Multi-turn · 6-DoF estimate · the composition stress-test

Given initial, target, and top-down views, the agent issues camera-control actions per turn, observes the resulting view, and within a turn budget submits a 6-DoF estimate of where the target view was taken. Unlike single-turn P2V/V2P, IVP requires planning a sequence of view changes.

Scene: scene0474_00  ·  Threshold: 0.5 m / 30°
Trained Qwen2.5-VL-7B plan: step 1 turn_right · step 2 turn_right × 2 · step 3 turn_right | look_down · step 4 move_left · step 5 move_forward · step 6 submit answer. Final error 0.061 m / 0°. Success.

Target view
Target
Initial view
Initial
Top-down view
Top-down
After step 1
step 1
After step 2
step 2
After step 3
step 3
After step 4
step 4
After step 5 (final)
step 5 (final ✓)
286
ScanNet scenes
~55K
view pairs
~165K
task instances
12
6-DoF actions
0.5m / 30°
IVP success threshold

Self-Exploration with View Graph Distillation

Each iteration alternates two stages. In the self-exploration stage, the agent interacts with ViewSuite environments and its trajectories are incrementally compressed into a view graph. In the view graph distillation stage, paths are sampled from this graph and reformulated into diverse view-planning demonstrations used to fine-tune the policy. The resulting model initializes the next self-exploration stage.

Iterative training pipeline (self-exploration + view graph distillation)

1. RL Stage

The agent runs IVP rollouts on ViewSuite environments with PPO. Reward is sparse: +1 when the submitted target estimate is within 0.5 m / 30° of the ground truth, plus a small format reward. Even with success rate near 2.5%, every rollout is useful, since it streams into the graph builder.

2. Graph Construction

A background process incrementally compresses every completed trajectory into a view graph. Nodes are viewpoints (with their rendered views); edges are actions between viewpoints. Nodes and edges are deduplicated via viewpoint similarity, so success and failure alike contribute to one shared structured graph.

3. Task Reformulation

Any path P = (v₀, a₁, v₁, …, aK, vK) in the graph yields a valid IVP demonstration regardless of whether the original episode succeeded: end node → target, start node → initial view, action chain → labeled plan. This is the lever that lets us learn from failed episodes.

4. SFT Stage

Sampled paths are reformulated into supervised view-planning demonstrations and used to fine-tune the policy with standard cross-entropy. The resulting model initializes the next RL stage, kicking off the iteration. Stages alternate RL → SFT → RL → SFT.

Results

Frontier VLM benchmark on ViewSuite-5K test (530 view pairs)

Accuracy / Success Rate (%) on Short (d < 3) and Long (d ≥ 3) splits. Best in each column is bold.

Model Path-to-View View-to-Path Interactive View Planning Overall
ShortLongAll ShortLongAll ShortLongAll
Random Response 20.724.623.3 24.326.525.7 2.20.00.8 16.6
Proprietary Models
GPT-5.4 Pro 70.743.853.1 72.439.050.7 32.611.018.5 40.8
Gemini 3.1 Pro 63.640.948.8 53.047.749.5 28.817.421.4 39.9
GPT-5.4 57.142.947.8 60.537.545.6 33.77.516.6 36.7
Grok 4.20 Beta 61.438.046.1 44.944.544.6 17.42.97.9 32.9
GPT-5.1 60.335.143.9 52.433.440.1 12.03.26.2 30.1
Claude Opus 4.6 46.728.434.8 47.638.441.6 23.93.810.8 29.0
Gemini 3 Pro 50.531.037.8 44.935.538.8 13.67.09.3 28.6
Open-Weight Models
Qwen3.5-397B 57.630.139.7 44.330.835.5 12.50.04.3 26.5
GLM-4.6V 36.423.227.8 31.429.730.2 9.21.24.0 20.7
Qwen2.5-VL-72B 28.329.328.9 35.729.931.9 2.20.61.1 20.7
Qwen3-VL-32B 27.227.527.4 41.128.532.9 4.30.01.5 20.6
Kimi K2.5 35.924.628.5 18.429.425.5 4.91.22.5 18.8
Qwen2.5-VL-7B 23.932.529.5 27.022.724.2 7.10.02.5 18.7

Training results on IVP — Qwen2.5-VL-7B base

Success rate (%) under the calibrated 0.5 m / 30° threshold. Our framework lifts a 7B model from 2.5% → 47.8%, beating every proprietary VLM evaluated.

MethodShortLongAll
Prompting Baselines
Qwen2.5-VL-7B-Instruct7.10.02.5
GPT-5.4 Pro32.611.018.5
Gemini 3.1 Pro28.817.421.4
Training Baselines
Direct PPO7.01.23.2
Direct GRPO (filter)10.82.25.2
Success-Only Bootstrapping14.02.06.2
Ablations of Our Framework
Random-graph20.04.610.0
1 iter + RL24.35.412.0
2 iter + RL52.422.633.0
Ours (Self-Exploration + View Graph Distillation, 3 iters)
Qwen2.5-VL-7B-Instruct67.236.947.8
Qwen3-VL-8B-Instruct67.537.548.0

Key Findings

1. Frontier VLMs hit a planning gap

Single-turn understanding ≫ multi-turn planning. The best VLMs reach ~70% on short-horizon P2V/V2P but collapse to ≤21% on Interactive View Planning. Most models score below 10%; on long-horizon samples most fall below 3%.

Dual axis: P2V/V2P degrade with rotation, IVP collapses with translation

P2V/V2P degrade primarily with rotation distance (cumulative rotations are hard to simulate mentally). IVP reverses this: success collapses with position distance, since 3D translation requires spatial layout understanding and path planning beyond simple orientation control.

2. Failed trajectories still teach view transitions

Direct PPO plateaus at 3.2%; GRPO with reward-variance filtering reaches only 5.2%; even iterating PPO with SFT on the small set of successful trajectories (Success-Only Bootstrapping) gets to 6.2%. The breakthrough comes from recognizing that even failed trajectories encode valid view transitions: moving from viewpoint A to B is meaningful supervision regardless of the original goal.

3. Self-Exploration + View Graph Distillation closes the gap

Our iterative framework alternates self-exploration with view graph distillation. Condensing all exploration trajectories (including failures) into a structured graph, then reformulating sampled paths into supervised view-planning demonstrations, takes Qwen2.5-VL-7B from 2.5% → 47.8% on IVP, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Random-graph ablation collapses to 13.0%, confirming on-policy graph construction is critical.

4. The trained agent learns to explore then approach

Tracked 3D point-cloud coverage reveals a clean two-phase strategy: scene coverage grows rapidly in early turns as the agent looks around, then plateaus while target intersection ratio accelerates in the middle turns as the agent moves toward the target. Base and frontier models show flat or erratic target coverage instead.

Two-phase exploration: scene coverage then target intersection

5. View-planning priors transfer to other view-related tasks

Under identical GRPO post-training, our trained model beats its base counterpart by 8 to 12 points on P2V/V2P (within ViewSuite) and by ~10 points on the external MindCube benchmark (no shared scenes / actions / rendering). Interactive view planning is not a narrow skill: the learned spatial priors strengthen view-dependent reasoning both within and beyond ViewSuite.

Cite

If you use ViewSuite or its trained models, please cite our paper.

@article{wang2026viewsuite,
  title   = {VLMs Walk the Scene: View Planning via Scene Self-Exploration},
  author  = {Wang, Kangrui and Li, Linjie and Yang, Zhengyuan and Chen, Shiqi and Wang, Zihan and Fei-Fei, Li and Wu, Jiajun and Guibas, Leonidas and Wang, Lijuan and Li, Manling},
  year    = {2026}
}