“I’ll do it this way.”
The actor speaks.
“That’s not great.”
The critic responds.
“Then what should I do?”
“You decide. I only evaluate.”
The most successful structure in reinforcement learning
begins with this separation of roles.
One who selects actions.
One who evaluates those actions.
Actor-Critic.
Where Two Paths Converge
We’ve examined two approaches.
Learning a value function.
Calculate Q(s, a) and select the action with the highest value.
High sample efficiency.
But limited in continuous action spaces.
Learning through Policy Gradient.
Represent the policy itself as a neural network and increase good actions’ probability.
Flexible in continuous action spaces.
But high variance and low sample efficiency.
Both methods have strengths and weaknesses.
What if we combined them?
In 1977, Ian Witten first proposed this idea.
In 1983, Barto, Sutton, and Anderson formalized it.
In 2016, DeepMind’s A3C revived this structure.
Today, almost every modern reinforcement learning algorithm
stands on this foundation.
The Actor Acts, the Critic Evaluates
The core of Actor-Critic is role separation.
Actor:
Handles the policy.
Takes state as input and outputs actions.
Decides “what to do.”
Critic:
Handles the value function.
Takes state (or state-action pair) as input and outputs value.
Evaluates “how good that action is.”
Two networks learn simultaneously.
The Actor improves its policy based on the Critic’s feedback.
The Critic refines its value estimates by observing the Actor’s action results.
There’s an analogy of a child and mother.
The child (Actor) constantly tries new things.
Eats toys, touches the hot oven, bangs head against the wall.
The mother (Critic) watches and either praises or scolds.
The child learns through the mother’s reactions.
The mother adjusts her evaluation criteria by watching the child’s behavior.
TD Error: The Language of Criticism
How does the Critic provide feedback to the Actor?
TD Error (Temporal Difference Error).
δ = r + γV(s') - V(s)
If better than expected, δ > 0.
If worse than expected, δ < 0.
This signal guides the Actor’s learning.
In REINFORCE, we had to wait until the episode ended.
Because it used actual cumulative reward G.
Monte Carlo method.
In Actor-Critic, we can update at every step.
Because the Critic estimates future value.
Bootstrapping.
Learning becomes faster.
Variance decreases.
In exchange, slight bias is introduced.
This is the bias-variance tradeoff.
REINFORCE’s high variance vs Actor-Critic’s lower bias.
The Advantage Function
More refined feedback is possible.
Advantage Function:
A(s, a) = Q(s, a) - V(s)
Q(s, a): Value of taking action a in state s.
V(s): Average value of state s.
If positive, better than average action.
If negative, worse than average action.
It measures relative gain, not absolute value.
In 1983, Leemon Baird first mentioned this.
In 1999, Sutton clearly defined it.
Using A(s, a) reduces variance further.
Because it asks not “Is this action good?”
but “Is this action better than alternatives?”
A2C and A3C
In 2016, DeepMind published A3C (Asynchronous Advantage Actor-Critic).
Multiple agents learn in parallel.
Each explores in a different environment.
Periodically synchronizes with the global network.
Asynchronous updates are the key feature.
They reduce correlation between data.
Stable learning without experience replay.
Achieved state-of-the-art performance
on Atari games and continuous control tasks.
A2C (Advantage Actor-Critic) is the synchronous version.
All agents update simultaneously.
Simpler implementation.
Better GPU utilization.
According to OpenAI’s research,
A2C performed equally or better than A3C.
The noise from asynchrony
didn’t actually contribute to performance gains.
PPO: The Stability Revolution
Several improvements were built
on the Actor-Critic structure.
TRPO (Trust Region Policy Optimization, 2015):
Limits the magnitude of policy updates.
Prevents drastic changes for stability.
But complex to implement.
PPO (Proximal Policy Optimization, 2017):
Simplified TRPO’s idea.
Limits update magnitude through clipping.
Easy to implement with good performance.
This is why PPO became the most widely used algorithm.
OpenAI’s robot hand used PPO to solve the Rubik’s Cube.
ChatGPT’s RLHF also relies on PPO at its core.
Balance of stability and performance.
The philosophy of Actor-Critic reached its peak in PPO.
SAC: Encouraging Exploration
SAC (Soft Actor-Critic, 2018).
Added entropy maximization to traditional Actor-Critic.
Not just maximizing reward,
but also maximizing policy randomness (entropy).
Why?
To encourage exploration.
Converging too quickly to one action
might miss better strategies.
SAC is an off-policy algorithm.
Can use experience replay.
High sample efficiency.
Outperforms PPO in robotics
and continuous control tasks.
SAC is also used in drug design.
When learning molecule generation policies,
it creates active molecules
while maintaining structural diversity.
Why Actor-Critic?
To summarize:
| Aspect | Value-Based | Policy-Based | Actor-Critic |
|---|---|---|---|
| Learning target | Q(s, a) | π(a|s) | Both |
| Action space | Discrete | Continuous | Both |
| Sample efficiency | High | Low | Medium-High |
| Variance | Low | High | Low |
| Stability | Unstable | Stable | Stable |
Actor-Critic combines the best of both worlds.
The Critic reduces variance.
The Actor handles continuous action spaces.
Both learn together and improve each other.
Summary
Action without criticism is blind.
Criticism without action is empty.
Actor-Critic separates and combines these two roles.
The Actor learns the policy.
The Critic estimates value.
TD error connects them.
A2C, A3C, PPO, SAC—
almost every advancement in modern reinforcement learning
has been built on this structure.
And this structure is what made ChatGPT
a model that generates “good responses.”


Leave a Reply