Evaluating Gemini Robotics Policies in a
Veo World Simulator

Gemini Robotics Team
Authors (alphabetical):
Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, Allan Zhou

We demonstrate a world model for the full suite of policy evaluation applications in robotics: from in-distribution evaluations, to out-of-distribution generalization, to safety.


Abstract

Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.

Method overview

Overview. We present a generative evaluation system built upon a frontier video foundation model (Veo). The base video model is fine-tuned to support robot action conditioning and multi-view video generation, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization.

Action Conditioning

Multi-view generation

Action Conditioning. We finetune the pretrained Veo model on a large-scale robotics dataset consisting of diverse tasks that cover a broad range of manipulation skills across a multitude of scenes. This fine-tuned robotic video generation model can be conditioned on a current image observation of the scene and a sequence of future robot poses, and can predict a sequence of future images that correspond to the future robot poses and observations. The video above shows an example of rendered poses overlaid over the video generated using these poses as conditioning.


Multi-view generation. In order to mitigate the effect of partial observations, we tile the four observations across four cameras in our setup, including the top-down view, the side view, and the left and right wrist view. We finetune Veo to generate the tiled future frames conditioned on the pose images described above. The video below shows an example of multi-view generation.

Experiments

We perform 1600+ real-world evaluations with eight generalist policy checkpoints and five tasks in order to demonstrate the following capabilities: (i) accurate prediction of relative performance and rankings of robot policies in pick-and-place tasks that are within the domain of the system’s training data, (ii) accurate prediction of the relative degradation caused by different axes of OOD generalization (objects, visual background, distractors) for a given policy, and also accurate prediction of the relative performance of different checkpoints, (iii) predictive red teaming for safety: by rolling out policies in edited scenes that involve safety-critical elements, we can discover potential vulnerabilities without hardware evaluations.


Evaluations in nominal scenarios

We train end-to-end vision-language-action (VLA) policies based on the Gemini Robotics On-Device (GROD) model. We then use the fine-tuned Veo (Robotics) model for evaluating policies in nominal (i.e., in-distribution) scenarios involving tasks, instructions, objects, distractors, and visual backgrounds that are similar to the training data used for policies and for fine-tuning the video model. Videos below show examples of video rollouts alongside real-world executions.

Policy comparison

We compare predictions made by Veo (Robotics) for 8 VLA policy checkpoints with real-world evaluations. The videos below show examples of Veo (Robotics) rollouts for two policies.

Strong policy

Weak policy

The plot below compares predictions from video rollouts with real-world success rates. We observe that Veo (Robotics) is able to rank the different policies by their performance. We also observe a strong correlation between predicted and actual success rates.

Policy Comparison Plot

Evaluating out-of-distribution generalization

We utilize generative image editing to create realistic and diverse variations of real-world scenes for probing OOD generalization. The videos below show examples of Veo (Robotics) rollouts alongside real-world executions in OOD scenarios.


The plot below compares predicted success rates for different OOD conditions with real-world success rates (for a particular policy). We observe that our evaluation method is able to rank the conditions by their impact on performance. We also observe a strong correlation between predicted and actual success rates.

OOD Comparison Plot

Evaluating Safety

We demonstrate that the Veo (Robotics) world model allows us to perform red teaming for safety. The videos below show examples of potentially unsafe behaviors found using the world model, and corresponding real-world experiments conducted using props.

Discussion

We are still in the early days of video modeling for robotics. The videos below demonstrate a number of challenges that remain to be addressed, including improved multi-view consistency and more realistic physical interactions.


Multi-view inconsistency

Object appearing

Object duplication

Unrealistic physical interaction

We expect that continued improvements in architectures and robotics-focused data will address many of these challenges. Looking forward, video models hold tantalizing potential: the ability to evaluate robots in an infinitely rich and varied proxy of the world would have a transformative impact on the path towards generalist embodied agents that operate usefully, capably, and safely in real-world environments.

BibTeX


@misc{veorobotics2025,
      title={Evaluating Gemini Robotics Policies in a Veo World Simulator}, 
      author={Gemini Robotics Team and Coline Devin and Yilun Du and Debidatta Dwibedi and Ruiqi Gao and Abhishek Jindal and Thomas Kipf and Sean Kirmani and Fangchen Liu and Anirudha Majumdar and Andrew Marmon and Carolina Parada and Yulia Rubanova and Dhruv Shah and Vikas Sindhwani and Jie Tan and Fei Xia and Ted Xiao and Sherry Yang and Wenhao Yu and Allan Zhou},
      year={2025},
      eprint={2512.10675},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.10675}, 
}
        

The website design was adapted from Nerfies.