Author: Luca
-

Integration of Senses
A robot picks up a hot cup. Eyes locate it. Hands feel the heat. Ears hear water pouring. All senses work together—this is Embodied AI. From Tesla Optimus to Boston Dynamics Atlas, humanoid robots are fusing vision, touch, and proprioception. The final chapter: how multimodal AI understands the world like humans do.
-

Text Guides Image
Noise has no direction. Without text, it stays noise. “A cat flying through space”—this sentence guides the generation. The image asks: what should I become? Text answers through cross-attention. How Stable Diffusion uses Query, Key, Value to turn prompts into pixels.
-

Language Models That Read Images
Language models process text. Images are pixels. How can GPT-4V ‘understand’ photos? The answer: three components. A vision encoder converts images to tokens, a projection layer bridges dimensions, and an LLM reasons over both. The architecture behind Vision-Language Models—and why they still hallucinate.
-

Learning from Human Feedback
November 2022. ChatGPT launched. 100 million users in 2 months. But GPT-3 existed since 2020—175 billion parameters. Why wasn’t it ChatGPT? The answer: RLHF. Reinforcement Learning from Human Feedback turned a language model into an assistant. How human preferences became the reward function.
-

The Critic and the Actor
“I’ll do it this way.” The actor speaks. “That’s not great.” The critic responds. The most successful structure in reinforcement learning separates action and evaluation. Actor-Critic combines value-based efficiency with policy-based flexibility—the foundation of A2C, A3C, PPO, SAC, and ChatGPT’s RLHF.
-

Learning the Policy Directly
Don’t calculate value. Just act. Like a basketball player who shoots without computing probabilities, Policy Gradient learns actions directly—no value function required. REINFORCE, continuous action spaces, and why both Physical AI and ChatGPT’s RLHF depend on this approach.
-

The Number Called Value
10,000 won tomorrow or 9,000 won today—which is more valuable? This question sits at the heart of reinforcement learning. Value functions compress uncertain futures into present numbers. The Bellman equation, TD learning, and Q-learning: the mathematics of foresight.
-

Reward Shapes Behavior
2016. A robot arm tries to pick up a cup. First attempt: miss. Hundredth attempt: drops it. Thousandth attempt: success. No one said “grasp like this.” Just a signal: pick up = +1, miss = 0. Reward shaped the behavior. This is reinforcement learning—learning without correct answers.


