Tag: reward model

  • Learning from Human Feedback

    Learning from Human Feedback

    November 2022. ChatGPT launched. 100 million users in 2 months. But GPT-3 existed since 2020—175 billion parameters. Why wasn’t it ChatGPT? The answer: RLHF. Reinforcement Learning from Human Feedback turned a language model into an assistant. How human preferences became the reward function.