NVIDIA Cosmos physical AI model showing robot action prediction and seasonal environment style transfer.

NVIDIA Cosmos: The Structure of World Foundation Models

How do you teach physics to a robot?

Show it millions of hours of video.
And let it discover for itself
how the world works.

This is the approach of NVIDIA Cosmos.


1. What is a World Foundation Model?

Just as Large Language Models (LLMs) learn patterns of language,
World Foundation Models (WFMs) learn patterns of the physical world.

“Push an object, and it moves.”
“Throw a ball, and it follows a parabola.”
“Tilt a cup, and water spills.”

What seems obvious to humans
is knowledge that robots must learn.

WFMs extract this physical common sense
from tens of millions of hours of video data.

NVIDIA Cosmos was trained on 20 million hours
of robotics and autonomous driving footage.
9,000 trillion tokens.
Within this massive dataset,
the operating principles of the world are encoded.


2. The Structure of Cosmos: Five Components

The Cosmos platform isn’t a single model.
It’s a system comprising five core components.

Video Curation Pipeline

The process of extracting high-quality training data from raw footage.

Filtering, annotation, classification, deduplication.
Vision-Language Models generate descriptions for each clip,
and similar videos are clustered to remove duplicates.

Garbage in, garbage out.
Data quality determines the upper bound of model performance.

Video Tokenizer

A device that converts video into a format the model can understand.

Just as LLMs split text into tokens,
the Cosmos Tokenizer compresses video into tokens.

Spatial compression (reducing resolution) and
temporal compression (reducing frame count) happen simultaneously.

What’s fascinating is the causal design.
Token computation for current frames
doesn’t depend on future frames.

This design matters because
Physical AI operates in a causal world.
Robots cannot see the future.
They must decide the next action based only on information available so far.

Pre-trained World Foundation Models

Two approaches coexist.

Diffusion Models:
Generate video through a denoising process.
Break a difficult generation problem into a series of easier denoising problems.

Autoregressive Models:
Generate video through next-token prediction.
The same principle as LLMs.
Repeatedly predicting the next token based on past tokens.

Cosmos provides both approaches.
In three sizes: Nano, Super, and Ultra.
From real-time inference to maximum quality generation,
choose according to your needs.

Post-training Framework

Tools for fine-tuning pre-trained models for specific tasks.

Models specialized for humanoid robot training,
models specialized for autonomous driving simulation.
Different expertise built on the same foundation.

Guardrails

A filtering system for safe usage.
Blocking harmful inputs and outputs.


3. Three Model Families: Predict, Transfer, Reason

Cosmos consists of three model families.
Each performs a different role.

Cosmos Predict

A model that predicts the future.

Takes text, images, and video as input
and generates physically plausible future video.

Input “a robot arm picks up a cup,”
and video of that action actually happening is generated.

Cosmos Predict 2.5 generates videos up to 30 seconds long
and supports multi-view camera output.

Cosmos Transfer

A model that transfers style and conditions.

Convert simulation footage to photorealistic footage,
or transform sunny driving footage into snowy conditions.

This is the key tool for reducing the Sim-to-Real Gap.
By converting synthetic simulation data
into data closer to reality,
the domain gap narrows.

Cosmos Reason

A model that reasons.

Watches video and explains in natural language
what is happening and what will happen next.

“A person is walking into the crosswalk.”
“A box is about to fall from the shelf.”

Through chain-of-thought reasoning,
it predicts the outcomes of physical interactions.


4. Why Open Model?

NVIDIA released Cosmos under an open model license.
Downloads have already exceeded 3 million.

Jensen Huang’s explanation is clear:

“Like large language models, world foundation models are fundamental
to advancing robot and AV development,
yet not all developers have the expertise and resources
to train their own.
We created Cosmos to democratize physical AI.”

Figure AI, 1X, Agility Robotics, Uber, Waabi.
Major robotics and autonomous vehicle companies
are already adopting Cosmos.

1X uses Cosmos Predict and Transfer
to train its humanoid robot NEO Gamma.

Skild AI uses Cosmos Transfer
to augment synthetic datasets for robot training.

The open-source strategy expands the ecosystem,
and ecosystem expansion increases demand for NVIDIA hardware.
A strategic virtuous cycle.


5. A New Paradigm for Physical AI Development

What does Cosmos change?

Traditional approach:
Collect data with real robots → Train → Test → Repeat.
Expensive, time-consuming, and dangerous.

Cosmos approach:
Mass-generate synthetic data with WFMs → Train in simulation → Transfer to reality.
Low cost, parallelizable, and safe.

“In three hours of simulation,
we can collect 100 days’ worth of data.”

This statement from MIT researchers
reveals the new economics of Physical AI development.

When the data scarcity problem is solved,
the practical realization of Physical AI accelerates.


Closing Thoughts

A World Foundation Model is
a model that has learned the “grammar” of the physical world.

Just as LLMs understand patterns of language,
WFMs understand patterns of gravity, friction, and collision.

Cosmos implements this understanding
through three capabilities:
video generation, style transfer, and physical reasoning.

“The ChatGPT moment for robotics”
may not be an exaggeration.

Just as LLMs democratized text generation,
WFMs may democratize physical simulation.

Cosmos is the first concrete implementation of that possibility.


Discover more from Luca — AI, Coffee & Structural Thinking

Subscribe to get the latest posts sent to your email.


Comments

Leave a Reply