Luca — AI, Coffee & Structural Thinking

Tag: ChatGPT

Learning from Human Feedback

Jan 22, 2026

—

by

Luca

in AI Works

November 2022. ChatGPT launched. 100 million users in 2 months. But GPT-3 existed since 2020—175 billion parameters. Why wasn’t it ChatGPT? The answer: RLHF. Reinforcement Learning from Human Feedback turned a language model into an assistant. How human preferences became the reward function.
Learning the Policy Directly

Jan 20, 2026

—

by

Luca

in AI Works

Don’t calculate value. Just act. Like a basketball player who shoots without computing probabilities, Policy Gradient learns actions directly—no value function required. REINFORCE, continuous action spaces, and why both Physical AI and ChatGPT’s RLHF depend on this approach.