WorldModelBench: Benchmarking World Models for Robot Learning

CoRL 2025 Workshop

Introduction

The rapid emergence of video-based world models has opened exciting new frontiers in robot learning, where such models are increasingly used as predictive simulators for perception, planning, and control. These world models aim to capture the physics and dynamics of real-world environments, enabling embodied agents to reason about the future, perform counterfactual simulations, and learn from visual experience. Despite recent breakthroughs in model design and scalability, evaluating the capabilities of these world models remains an open challenge. Recent benchmarks such as WorldScore, WorldSimBench, VideoPhy, Physics-IQ, and VBench have made important strides in evaluating world models across key dimensions, including controllability, dynamics, physical plausibility, and alignment with prompts and goals. These efforts reveal that while modern models can generate visually compelling videos, many still struggle with physical commonsense, temporal coherence, or robot-action consistency, which are crucial for downstream robotic applications.

This workshop at CoRL 2025 seeks to bring together the community to explore how world models can be systematically evaluated and improved to serve as robust backbones for robotic perception, planning, and interaction. The workshop will facilitate discussions on benchmark design, evaluation protocols, and task-driven validation, with the overarching goal of making world models more actionable and trustworthy for real-world robot learning. Invited presenters and panelists will be drawn from leading groups working on video generation, simulation, video foundation models, with representation from both academia (e.g., Stanford, HKU, NTU) and industry (e.g., Google, Meta, NVIDIA).

Format and Schedule

The workshop will be structured as a half-day event (approximately 4 hours), featuring a mix of invited talks, a panel discussion, and interactive audience engagement. We will begin with a series of invited talks (20-30 minutes each) from leading researchers in world model development, video generation, and robot learning. These talks will highlight recent advances, emerging benchmarks, and use cases in robotics. Following the talks, we will host a moderated panel discussion with a diverse set of experts from both academia and industry to debate key open questions around benchmarking, generalization, physical realism, and deployment of world models in embodied agents. To ensure active participation, we will solicit questions from the community in advance, allowing audience-driven topics to guide part of the panel. Time will also be allocated for open Q&A and discussion.

Invited Speakers

Ziwei Liu is an Associate Professor at MMLab@NTU, College of Computing and Data Science in Nanyang Technological University, Singapore.

Xihui Liu is an Assistant Professor at the Department of Electrical and Electronic Engineering and Institute of Data Science (IDS), The University of Hong Kong.

Ruimao Zhang is a Tenure-track Associate Professor in the Spatial Artificial Intelligence Lab with the School of Electronics and Communication Engineering, Sun Yat-sen University.

Jiajun Wu is an Assistant Professor of Computer Science and, by courtesy, of Psychology at Stanford University.

Robert Geirhos is a Senior Research Scientist at Google DeepMind (formerly Google Brain), located in Toronto.

Jinwei Gu is a principal research scientist at NVIDIA, working on deep generative models, vision foundation models, world models, and the general fields of computer vision, computer graphics, and machine learning.

Nicklas Hansen is a PhD candidate at UC San Diego, advised by Prof. Xiaolong Wang and Prof. Hao Su, and is interested in building scalable, robust, and open-source algorithms for decision-making.