Tag: PPO

  • Learning from Human Feedback

    Learning from Human Feedback

    November 2022. ChatGPT launched. 100 million users in 2 months. But GPT-3 existed since 2020—175 billion parameters. Why wasn’t it ChatGPT? The answer: RLHF. Reinforcement Learning from Human Feedback turned a language model into an assistant. How human preferences became the reward function.

  • The Critic and the Actor

    The Critic and the Actor

    “I’ll do it this way.” The actor speaks. “That’s not great.” The critic responds. The most successful structure in reinforcement learning separates action and evaluation. Actor-Critic combines value-based efficiency with policy-based flexibility—the foundation of A2C, A3C, PPO, SAC, and ChatGPT’s RLHF.

  • Learning the Policy Directly

    Learning the Policy Directly

    Don’t calculate value. Just act. Like a basketball player who shoots without computing probabilities, Policy Gradient learns actions directly—no value function required. REINFORCE, continuous action spaces, and why both Physical AI and ChatGPT’s RLHF depend on this approach.