A split-panel illustration comparing reinforcement learning approaches. The left panel shows a retro robot calculating with a spreadsheet thought bubble at a fork in the road, representing value-based methods. The right panel shows a futuristic neural network-powered robot arm smoothly shooting a basketball into a hoop, representing policy gradient.

Learning the Policy Directly

Don’t calculate value.
Just act.

When a basketball player shoots,
they don’t calculate the probability of scoring.
Wrist angle, knee bend, release timing—
they simply execute the motion their body remembers.

Reinforcement learning has the same approach.

A method that learns the policy directly,
without going through value functions.

Policy Gradient.


The Limits of Value-Based Methods

Q-learning is powerful.
But it has structural limitations.

First, it’s optimized for discrete actions.

Cases where choices are clearly separated,
like “go left” or “go right.”

Consider a robot arm.
It has 6 joints.
Each joint can rotate from 0 to 180 degrees.
Infinite combinations are possible.

Can you store Q-values for every joint angle combination?
Impossible.
In continuous action spaces, Q-learning falls into the curse of dimensionality.

Self-driving cars face the same problem.
Turn the wheel 15 degrees, or 17.2 degrees, or 19.4 degrees?
Honk the horn?
You need to calculate Q-values for each action.
Finding the maximum in continuous output becomes an optimization problem itself.

Second, it creates deterministic policies.

Q-learning’s policy is simple:
select the action with the highest Q-value.

π(s) = argmax Q(s, a)

Always the same action in the same state.
Exploration becomes difficult.

Consider rock-paper-scissors.
“Always rock” is the worst strategy.
Once your opponent detects the pattern, it’s over.
You need to mix randomly to win.

Sometimes a stochastic policy is necessary.

Third, the perceptual aliasing problem.

Sometimes two states look identical
but require different actions.

Consider a vacuum cleaner robot.
Its goal: suck up dust and avoid hamsters.
The sensor shows a wall on the left.
Should it go right or forward?

A deterministic policy always makes the same choice.
A stochastic policy can randomly choose between left and right.
It won’t get stuck in dead ends.


Parameterizing the Policy Directly

Policy Gradient inverts the approach.

Instead of learning a value function and deriving a policy from it,
represent the policy itself as a neural network.

π_θ(a|s) = probability of selecting action a in state s

θ represents the neural network’s parameters.
Adjust these parameters to create a better policy.

Input: state
Output: probability distribution over actions

The agent samples actions according to this probability.
Different actions can emerge from the same state.
Exploration is naturally built in.

No need to store values.
Estimate the policy directly without additional data.


The Objective Function

What do we optimize?

Expected cumulative reward.

J(θ) = E[R | π_θ]

The expected total reward when following policy π_θ.

The goal is to find θ that maximizes this value.

In neural network training, we minimized a loss function.
In Policy Gradient, we maximize an objective function.

Only the direction is reversed. The principle is the same.
Climb along the gradient. Gradient Ascent.

Convergence properties also differ.
Q-learning can be unstable under function approximation.
Overestimation bias and temporal correlation issues.

Policy Gradient guarantees convergence to at least a local optimum.
Given a sufficiently small learning rate, because it follows the gradient.


The Policy Gradient Theorem

How do we compute the gradient of J(θ)?

There’s a problem.
J(θ) depends not only on actions determined by the policy,
but also on the state distribution resulting from those actions.
The environment is unknown.
It’s difficult to estimate how policy updates affect the state distribution.

In 1992, Ronald Williams provided the solution.

Policy Gradient Theorem:

∇J(θ) = E[∇log π_θ(a|s) · R]

It looks complex, but the meaning is simple:

∇log π_θ(a|s): the direction that increases the probability of action a
R: the reward that action brought

If the reward is high, increase that action’s probability.
If the reward is low, decrease that action’s probability.

Intuitive.
Do more of what worked well.
Do less of what worked poorly.

The gradient can be estimated from experience alone, without an environment model.
This is the key insight.


The REINFORCE Algorithm

The most basic implementation of Policy Gradient
is the REINFORCE algorithm.

1. Run a complete episode with the current policy
2. Record actions and cumulative rewards at each timestep
3. Update parameters using the policy gradient theorem
4. Repeat

The key is that it’s a Monte Carlo method.
Wait until the episode ends.
Use actually received rewards.

Unlike TD learning, it doesn’t use estimates from one step ahead.
No bootstrapping.
No estimation error.
But learning is slow if episodes are long.

It’s also an on-policy method.
Only data collected with the current policy can be used.
Different from Q-learning’s experience replay that reuses past data.
Sample efficiency is low.

But there are advantages.
No large replay buffer needed.
Lower memory usage.


The Variance Problem

REINFORCE is simple,
but has a critical weakness.

High variance.

Even with the same policy, rewards can vary greatly between episodes.
Environmental randomness, action sampling randomness.

Imagine a shooting game.
Even with the same strategy,
results differ based on enemy spawn locations and item placement.

Because gradients are estimated from rollouts,
variance can become extremely high.

If gradient estimates fluctuate wildly,
learning becomes unstable.
Convergence slows down.

One solution is a baseline.

∇J(θ) = E[∇log π_θ(a|s) · (R - b)]

Subtract a baseline value b from the reward.
Positive if better than average, negative if worse.

Judging by relative performance instead of absolute reward reduces variance.

And if we use a value function as this baseline?

A structure that learns both policy and value together emerges.
The beginning of Actor-Critic.


Continuous Action Spaces

Policy Gradient’s true strength is
handling continuous actions.

Robot arm joint angles.
Car steering angles.
Drone thrust.

The policy network outputs mean and variance,
and samples actions from a Gaussian distribution.

π_θ(a|s) = N(μ_θ(s), σ_θ(s))

An area impossible for Q-learning.

In 2019, OpenAI successfully solved a Rubik’s Cube with a robot hand.
The Shadow Dexterous Hand with 24 degrees of freedom.
Each joint’s fine movements had to be controlled.

The algorithm used was PPO (Proximal Policy Optimization).
An advanced form of Policy Gradient.
It solved REINFORCE’s variance problem
while greatly improving training stability.

PPO was also used in ChatGPT’s RLHF (Reinforcement Learning from Human Feedback).
Policy Gradient plays a central role
in aligning language model outputs with human preferences.

Most of Physical AI uses this approach.


Balancing Exploration and Exploitation

Stochastic policies provide natural exploration.

Q-learning needed techniques like ε-greedy.
“Sometimes take random actions.”
The balance between exploration and exploitation had to be manually tuned.

In Policy Gradient, the policy itself is a probability distribution.
Exploration is built in.

Early in training, variance is high.
Various actions are attempted.

As training progresses, variance decreases.
Probability concentrates on good actions.

The transition from exploration to exploitation
is naturally embedded in the algorithm.

Adding entropy regularization can further encourage exploration.
SAC (Soft Actor-Critic) uses this approach.


Value or Policy?

Comparing the two approaches structurally:

AspectValue-BasedPolicy-Based
Learning targetQ(s, a)π(a|s)
Action spaceDiscreteContinuous possible
Policy typeDeterministicStochastic possible
Sample efficiencyHigh (experience reuse)Low (on-policy)
Convergence stabilityCan be unstableMore stable (gradient-based)
Memory usageHigh (replay buffer)Low
VarianceLowHigh

Which is better?

Both are right, and both are insufficient.

Value-based uses samples efficiently,
but struggles with continuous actions and stochastic policies.
It has overestimation bias problems.

Policy-based is flexible,
but requires many samples and has high variance.
It can get stuck in local optima.

So there’s a next step.
A method that combines both.


Summary

You can learn actions without calculating value.

Policy Gradient represents the policy itself as a neural network,
learning in the direction that increases good actions’ probability.

REINFORCE is simple but has high variance.
Baselines mitigate this.

It shines in continuous action spaces.
OpenAI’s Rubik’s Cube robot, autonomous vehicles, drone control.
Even ChatGPT’s RLHF.
The core algorithm for both Physical AI and LLMs.

But a dilemma remains.
The efficiency of value and the flexibility of policy.

The next part resolves this dilemma.


Discover more from Luca — AI, Coffee & Structural Thinking

Subscribe to get the latest posts sent to your email.


Comments

Leave a Reply