VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang*1   Max Ku*1   Ka-Hei Hui2   Ping Nie3   Wenhu Chen1
1University of Waterloo   2Autodesk AI Lab   3Independent   *Equal contribution

Abstract

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.

VisPhyWorld

Overview

We propose VisPhyWorld, a framework that uses LLMs to interpret raw video frames and generate executable simulation code for predicting future motion. To our knowledge, this is the first paradigm that evaluates physical reasoning in MLLMs through code reconstruction and re-simulation. By making object states and dynamics explicit, VisPhyWorld provides a direct and interpretable view of a model’s physical understanding.

VisPhyWorld framework overview

Engine choice

Engine choice matters for physical fidelity: physics-native backends like Three.js and P5.js let the generated program offload gravity, collisions, friction, and contact constraints to a rigid-body solver, producing more physically consistent rollouts, whereas non-physics renderers (e.g., SVG/Manim) often degrade into heuristic motion scripting with artifacts such as static objects or interpenetration.

Engine evaluation example: physics-native backends vs non-physics renderers

VisPhyBench

Overview

To evaluate how well models reconstruct appearance and reproduce physically plausible motion, we introduce VisPhyBench, a unified evaluation protocol comprising 209 scenes derived from 108 physical templates that assesses physical understanding through the lens of code-driven resimulation in both 2D and 3D scenes, integrating metrics from different aspects. Each scene is also annotated with a coarse difficulty label (easy/medium/hard).

Teaser figure (click to open PDF)

VisPhyBench Advantages

Existing work has evaluated physical reasoning using video prediction benchmarks that test whether predicted future dynamics remain consistent, and Violation-of-Expectation paradigms that measure whether physically impossible events trigger greater predictive surprise. These approaches fit generative world models with explicit prediction objectives, but they do not transfer cleanly to MLLMs, which primarily generate text rather than predictive distributions or future videos. More recent benchmarks extend this setting to generative video models, while MLLM-based evaluators typically frame assessment as recognition tasks such as visual question answering. Although effective for probing high-level physical knowledge, recognition-style protocols can conflate true physical reasoning with appearance-based heuristics and dataset bias. In contrast, our framework improves interpretability by requiring models to output explicit, executable physical hypotheses that are validated through simulation, making success and failure easier to attribute.

Table 1

Experiment Results

Main Results

Table 2 summarizes performance across five metric families on VisPhyBench. Overall, most models achieve strong reconstruction and perceptual scores and maintain reasonable visual-semantic consistency; these results support our central claim that, once the task is cast as executable hypotheses under a fixed physics engine, most modern MLLM can reconstruct synthetic physical events with high fidelity, and the remaining gaps become diagnosable rather than opaque.

Table 2

Motion and physical plausibility.

Assessing physical plausibility requires balancing motion statistics with perceptual coherence, since any single metric can be misleading. RAFT-EPE captures optical-flow agreement, yet some models achieve deceptively good flow by producing static or low-information outputs, while others match appearance well but violate event logic. We therefore adopt a joint evaluation strategy: models demonstrate valid physical understanding only when they exhibit both correct motion dynamics (low flow error) and holistic physical coherence under the Gemini rubric. Our case studies support this criterion: GPT-5 in Three.js faithfully simulates collision-driven interactions, whereas pixel-space baselines can look semantically plausible yet hallucinate dynamics; conversely, Qwen3-VL-Plus can appear favorable under flow alone but is penalized by holistic judgment. The same dissociation persists in perspective-rendered 3D scenes with depth-dependent contacts and occlusions, where semantic similarity fails to separate physically incorrect reconstructions. Overall, credible physical understanding is evidenced only when motion fidelity and holistic visual plausibility are simultaneously satisfied, and 3D scenes are necessary to stress-test reconstruction-based physical reasoning.

Case study: task10063_000
Case study: dataset3D 14_tmpl_14

Case Study Examples

BibTeX

@misc{visphyworld2026,
  title   = {VisPhyWorld},
  author  = {VisPhyWorld Team},
  year    = {2026},
  note    = {Project page},
  url     = {https://github.com/TIGER-AI-Lab/VisPhyWorld}
}