Tag: ChatGPT

  • Learning from Human Feedback

    Learning from Human Feedback

    November 2022. ChatGPT launched. 100 million users in 2 months. But GPT-3 existed since 2020—175 billion parameters. Why wasn’t it ChatGPT? The answer: RLHF. Reinforcement Learning from Human Feedback turned a language model into an assistant. How human preferences became the reward function.

  • Learning the Policy Directly

    Learning the Policy Directly

    Don’t calculate value. Just act. Like a basketball player who shoots without computing probabilities, Policy Gradient learns actions directly—no value function required. REINFORCE, continuous action spaces, and why both Physical AI and ChatGPT’s RLHF depend on this approach.