The Critic and the Actor

“I’ll do it this way.”
The actor speaks.

“That’s not great.”
The critic responds.

“Then what should I do?”

“You decide. I only evaluate.”

The most successful structure in reinforcement learning
begins with this separation of roles.

One who selects actions.
One who evaluates those actions.

Actor-Critic.

Where Two Paths Converge

We’ve examined two approaches.

Learning a value function.
Calculate Q(s, a) and select the action with the highest value.
High sample efficiency.
But limited in continuous action spaces.

Learning through Policy Gradient.
Represent the policy itself as a neural network and increase good actions’ probability.
Flexible in continuous action spaces.
But high variance and low sample efficiency.

Both methods have strengths and weaknesses.

What if we combined them?

In 1977, Ian Witten first proposed this idea.
In 1983, Barto, Sutton, and Anderson formalized it.
In 2016, DeepMind’s A3C revived this structure.

Today, almost every modern reinforcement learning algorithm
stands on this foundation.

The Actor Acts, the Critic Evaluates

The core of Actor-Critic is role separation.

Actor:
Handles the policy.
Takes state as input and outputs actions.
Decides “what to do.”

Critic:
Handles the value function.
Takes state (or state-action pair) as input and outputs value.
Evaluates “how good that action is.”

Two networks learn simultaneously.

The Actor improves its policy based on the Critic’s feedback.
The Critic refines its value estimates by observing the Actor’s action results.

There’s an analogy of a child and mother.
The child (Actor) constantly tries new things.
Eats toys, touches the hot oven, bangs head against the wall.
The mother (Critic) watches and either praises or scolds.

The child learns through the mother’s reactions.
The mother adjusts her evaluation criteria by watching the child’s behavior.

TD Error: The Language of Criticism

How does the Critic provide feedback to the Actor?

TD Error (Temporal Difference Error).

δ = r + γV(s') - V(s)

If better than expected, δ > 0.
If worse than expected, δ < 0.

This signal guides the Actor’s learning.

In REINFORCE, we had to wait until the episode ended.
Because it used actual cumulative reward G.
Monte Carlo method.

In Actor-Critic, we can update at every step.
Because the Critic estimates future value.
Bootstrapping.

Learning becomes faster.
Variance decreases.
In exchange, slight bias is introduced.

This is the bias-variance tradeoff.
REINFORCE’s high variance vs Actor-Critic’s lower bias.

The Advantage Function

More refined feedback is possible.

Advantage Function:

A(s, a) = Q(s, a) - V(s)

Q(s, a): Value of taking action a in state s.
V(s): Average value of state s.

If positive, better than average action.
If negative, worse than average action.

It measures relative gain, not absolute value.

In 1983, Leemon Baird first mentioned this.
In 1999, Sutton clearly defined it.

Using A(s, a) reduces variance further.
Because it asks not “Is this action good?”
but “Is this action better than alternatives?”

A2C and A3C

In 2016, DeepMind published A3C (Asynchronous Advantage Actor-Critic).

Multiple agents learn in parallel.
Each explores in a different environment.
Periodically synchronizes with the global network.

Asynchronous updates are the key feature.
They reduce correlation between data.
Stable learning without experience replay.

Achieved state-of-the-art performance
on Atari games and continuous control tasks.

A2C (Advantage Actor-Critic) is the synchronous version.
All agents update simultaneously.
Simpler implementation.
Better GPU utilization.

According to OpenAI’s research,
A2C performed equally or better than A3C.
The noise from asynchrony
didn’t actually contribute to performance gains.

PPO: The Stability Revolution

Several improvements were built
on the Actor-Critic structure.

TRPO (Trust Region Policy Optimization, 2015):
Limits the magnitude of policy updates.
Prevents drastic changes for stability.
But complex to implement.

PPO (Proximal Policy Optimization, 2017):
Simplified TRPO’s idea.
Limits update magnitude through clipping.
Easy to implement with good performance.

This is why PPO became the most widely used algorithm.

OpenAI’s robot hand used PPO to solve the Rubik’s Cube.
ChatGPT’s RLHF also relies on PPO at its core.

Balance of stability and performance.
The philosophy of Actor-Critic reached its peak in PPO.

SAC: Encouraging Exploration

SAC (Soft Actor-Critic, 2018).

Added entropy maximization to traditional Actor-Critic.

Not just maximizing reward,
but also maximizing policy randomness (entropy).

Why?

To encourage exploration.
Converging too quickly to one action
might miss better strategies.

SAC is an off-policy algorithm.
Can use experience replay.
High sample efficiency.

Outperforms PPO in robotics
and continuous control tasks.

SAC is also used in drug design.
When learning molecule generation policies,
it creates active molecules
while maintaining structural diversity.

Why Actor-Critic?

To summarize:

Aspect	Value-Based	Policy-Based	Actor-Critic
Learning target	Q(s, a)	π(a\|s)	Both
Action space	Discrete	Continuous	Both
Sample efficiency	High	Low	Medium-High
Variance	Low	High	Low
Stability	Unstable	Stable	Stable

Actor-Critic combines the best of both worlds.

The Critic reduces variance.
The Actor handles continuous action spaces.
Both learn together and improve each other.

Summary

Action without criticism is blind.
Criticism without action is empty.

Actor-Critic separates and combines these two roles.

The Actor learns the policy.
The Critic estimates value.
TD error connects them.

A2C, A3C, PPO, SAC—
almost every advancement in modern reinforcement learning
has been built on this structure.

And this structure is what made ChatGPT
a model that generates “good responses.”

Where Two Paths Converge

The Actor Acts, the Critic Evaluates

TD Error: The Language of Criticism

The Advantage Function

A2C and A3C

PPO: The Stability Revolution

SAC: Encouraging Exploration

Why Actor-Critic?

Summary

Like this:

Discover more from Luca — AI, Coffee & Structural Thinking

Comments

Leave a ReplyCancel reply

The Critic and the Actor

Where Two Paths Converge

The Actor Acts, the Critic Evaluates

TD Error: The Language of Criticism

The Advantage Function

A2C and A3C

PPO: The Stability Revolution

SAC: Encouraging Exploration

Why Actor-Critic?

Summary

Share this:

Like this:

Discover more from Luca — AI, Coffee & Structural Thinking

Comments

Leave a ReplyCancel reply