Category: AI Works

  • Integration of Senses

    Integration of Senses

    A robot picks up a hot cup. Eyes locate it. Hands feel the heat. Ears hear water pouring. All senses work together—this is Embodied AI. From Tesla Optimus to Boston Dynamics Atlas, humanoid robots are fusing vision, touch, and proprioception. The final chapter: how multimodal AI understands the world like humans do.

  • Text Guides Image

    Text Guides Image

    Noise has no direction. Without text, it stays noise. “A cat flying through space”—this sentence guides the generation. The image asks: what should I become? Text answers through cross-attention. How Stable Diffusion uses Query, Key, Value to turn prompts into pixels.

  • Language Models That Read Images

    Language Models That Read Images

    Language models process text. Images are pixels. How can GPT-4V ‘understand’ photos? The answer: three components. A vision encoder converts images to tokens, a projection layer bridges dimensions, and an LLM reasons over both. The architecture behind Vision-Language Models—and why they still hallucinate.

  • Contrast Creates Meaning

    Contrast Creates Meaning

    Labels aren’t necessary. ImageNet needed 25,000 workers to label 14 million images. But the internet already has the answers—400 million image-text pairs. CLIP learned without labels and classifies things it’s never seen. How contrastive learning aligned images and text into one space.

  • Into a Shared Space

    Into a Shared Space

    2012. CNN conquered images. Transformer conquered text. But each lived in separate worlds—vectors that couldn’t compare. What if a cat photo and the word “cat” existed at the same location? Shared embedding space makes this possible. How CLIP and ImageBind unified different senses into one language.

  • Learning from Human Feedback

    Learning from Human Feedback

    November 2022. ChatGPT launched. 100 million users in 2 months. But GPT-3 existed since 2020—175 billion parameters. Why wasn’t it ChatGPT? The answer: RLHF. Reinforcement Learning from Human Feedback turned a language model into an assistant. How human preferences became the reward function.

  • The Critic and the Actor

    The Critic and the Actor

    “I’ll do it this way.” The actor speaks. “That’s not great.” The critic responds. The most successful structure in reinforcement learning separates action and evaluation. Actor-Critic combines value-based efficiency with policy-based flexibility—the foundation of A2C, A3C, PPO, SAC, and ChatGPT’s RLHF.

  • Learning the Policy Directly

    Learning the Policy Directly

    Don’t calculate value. Just act. Like a basketball player who shoots without computing probabilities, Policy Gradient learns actions directly—no value function required. REINFORCE, continuous action spaces, and why both Physical AI and ChatGPT’s RLHF depend on this approach.

  • The Number Called Value

    The Number Called Value

    10,000 won tomorrow or 9,000 won today—which is more valuable? This question sits at the heart of reinforcement learning. Value functions compress uncertain futures into present numbers. The Bellman equation, TD learning, and Q-learning: the mathematics of foresight.

  • Reward Shapes Behavior

    Reward Shapes Behavior

    2016. A robot arm tries to pick up a cup. First attempt: miss. Hundredth attempt: drops it. Thousandth attempt: success. No one said “grasp like this.” Just a signal: pick up = +1, miss = 0. Reward shaped the behavior. This is reinforcement learning—learning without correct answers.