PPO - Luca — AI, Coffee & Structural Thinking

Learning from Human Feedback

Jan 22, 2026

—

by

November 2022. ChatGPT launched. 100 million users in 2 months. But GPT-3 existed since 2020—175 billion parameters. Why wasn’t it ChatGPT? The answer: RLHF. Reinforcement Learning from Human Feedback turned a language model into an assistant. How human preferences became the reward function.

The Critic and the Actor

Jan 22, 2026

—

by

Luca

in AI Works

“I’ll do it this way.” The actor speaks. “That’s not great.” The critic responds. The most successful structure in reinforcement learning separates action and evaluation. Actor-Critic combines value-based efficiency with policy-based flexibility—the foundation of A2C, A3C, PPO, SAC, and ChatGPT’s RLHF.

Learning the Policy Directly

Jan 20, 2026

—

by

Luca

in AI Works

Don’t calculate value. Just act. Like a basketball player who shoots without computing probabilities, Policy Gradient learns actions directly—no value function required. REINFORCE, continuous action spaces, and why both Physical AI and ChatGPT’s RLHF depend on this approach.

Tag: PPO

Learning from Human Feedback

The Critic and the Actor

Learning the Policy Directly