Sergio Arnaud

World Models @ Waymo · prev AI Research @ FAIR
Sergio Arnaud

About

I work on multimodal models that learn predictive representations of the physical world — how machines build internal models that support prediction, planning, and generalization. I'm currently on the World Models team at Waymo, and previously spent ~3 years as an AI researcher at Meta FAIR.

My work spans self-supervised video representation learning, 3D vision-language grounding, and large-scale multimodal training. I'm increasingly interested in interpretability — understanding what these predictive models actually learn internally.

Before research, I led the AI efforts at deep dive and majored in Applied Mathematics.

Background

Senior ML Engineer

Waymo · March 2026 – Present · Mountain View, CA

World models and multimodal foundation models for autonomous driving

Senior Research Engineer

Meta FAIR · February 2024 – February 2026 · Menlo Park, CA

World models for robotics, 3D vision-language grounding for robotic manipulation, and physical world modeling

AI Resident

Meta FAIR · September 2022 – September 2023 · Menlo Park, CA

Visual representations for robot control, language models for planning, and embodied AI research

Tech Lead (AI)

deep dive (dive.ai) · January 2018 – July 2022 · Mexico City, Mexico

Computer Vision and Natural Language Processing systems

BSc Applied Mathematics

Instituto Tecnológico Autónomo de México (ITAM) · 2020 · Mexico City, Mexico

Graduated with highest honors (Magna Cum Laude), top 3% of students

Featured Publications

World Modeling

Learning predictive models of the world for planning and decision making

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran* , A. Bardes* , D. Fan* , Q. Garrido* , R. Howes* , M. Komeili* , M. Muckley* , A. Rizvi* , C. Roberts* , K. Sinha* , A. Zholus* , Sergio Arnaud* , et al.

arXiv 2025
Human-level Learning of Complex Novel Tasks as Theory-Based Modeling, Exploration, and Planning demo

Human-level Learning of Complex Novel Tasks as Theory-Based Modeling, Exploration, and Planning

P.A. Tsividis , J. Loula , J. Burga , J.P. Rodriguez , Sergio Arnaud , N. Foss , A. Campero , A. Subramanian , T. Pouncy , S.J. Gershman , J.B. Tenenbaum

Philosophical Transactions of the Royal Society A
Visuo-Tactile World Models demo

Visuo-Tactile World Models

Carolina Higuera , Sergio Arnaud , Byron Boots , Mustafa Mukadam , Francois Robert Hogan , Franziska Meier

In Press
DreamSteer: Latent World Models Can Steer VLA Policies During Deployment demo

DreamSteer: Latent World Models Can Steer VLA Policies During Deployment

H. Cui , Sergio Arnaud , A. Majumdar , D. Dugas , E. Aljalbout , K. Desingh , K.M. Jatavallabhula , F. Meier

In Press
Beyond Latents: Planning with Motion Cues in World Models demo

Beyond Latents: Planning with Motion Cues in World Models

S. Yenamandra , Sergio Arnaud , H. Huang , T.-Y. Yang , E. Aljalbout , A. Majumdar , D. Sadigh , H. Bharadhwaj , F. Meier

In Press

Heterogeneous World Models for Cross-Embodiment Transfer

Sergio Arnaud , et al.

In Progress

3D Vision & Spatial Reasoning

Grounding language in 3D space for embodied understanding

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Sergio Arnaud* , P. McVay* , A. Martin* , A. Majumdar , K.M. Jatavallabhula , P. Thomas , R. Partsey , D. Dugas , A. Gejji , A. Sax , et al.

ICML 2025 Spotlight Top 2.6%

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation (LiftGS)

A. Cao , Sergio Arnaud , O. Maksymets , J. Yang , A. Jain , S. Yenamandra , A. Martin , V.-P. Berges , P. McVay , R. Partsey , et al.

ICML 2025
Unifying 2D and 3D Vision-Language Understanding (UniVLG) demo

Unifying 2D and 3D Vision-Language Understanding (UniVLG)

A. Jain , A. Swerdlow , Y. Wang , Sergio Arnaud , A. Martin , A. Sax , F. Meier , K. Fragkiadaki

ICML 2025

OpenEQA: Embodied Question Answering in the Era of Foundation Models

A. Majumdar* , A. Ajay* , X. Zhang* , P. Putta , S. Yenamandra , M. Henaff , S. Silwal , P. McVay , O. Maksymets , Sergio Arnaud , K. Yadav , Q. Li , B. Newman , et al.

CVPR 2024

Representation Learning

Learning visual representations that transfer across embodiments and tasks

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? (VC-1)

A. Majumdar* , K. Yadav* , Sergio Arnaud* , Y.J. Ma , C. Chen , S. Silwal , A. Jain , V.-P. Berges , T. Wu , J. Vakil , et al.

NeurIPS 2023
What do we learn from a large-scale study of pre-trained visual representations in sim and real environments? demo

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

S. Silwal* , K. Yadav* , T. Wu* , J. Vakil* , A. Majumdar* , Sergio Arnaud* , C. Chen , V.-P. Berges , D. Batra , A. Rajeswaran , et al.

ICRA 2024

Robot Planning & Skill Coordination

Enabling robots to chain skills and plan complex behaviors

ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation

N. Yokoyama , A. Clegg , J. Truong , E. Undersander , T.-Y. Yang , Sergio Arnaud , S. Ha , D. Batra , A. Rai

IEEE RA-L 2025

Language-Guided Skill Coordination (LSC)

Sergio Arnaud , et al.

CVPR 2024 Demo

Contact