$10 tomorrow,
or $9 today.
Which is more valuable?
Most would say “$9 today.”
Tomorrow is uncertain. Today is certain.
But the moment you ask “What if tomorrow were certain?”—
the answer wavers.
This is the core question of reinforcement learning.
How do we convert future rewards into present numbers?
Reward Alone Is Not Enough
In the reward-driven learning structure,
agents interact with environments and learn through feedback.
But reward has limitations.
Consider chess.
You only know if you won or lost when the game ends.
The moves in between receive no immediate reward.
So how does the agent know whether a move 30 turns ago was good or bad?
If we could express “how good is this state?” as a number,
judgment becomes possible even without immediate reward.
That number is value.
The State Value Function
The value function predicts
the total future rewards starting from a specific state.
Expressed mathematically:
V(s) = E[Rt+1 + Rt+2 + Rt+3 + ...]
Starting from state s,
the expected sum of all future rewards.
But there’s a problem.
If the future extends infinitely, the sum grows infinitely.
So we introduce a discount factor (γ).
V(s) = E[Rt+1 + γRt+2 + γ²Rt+3 + ...]
γ is a value between 0 and 1.
If it’s 0.9, the reward one step later is only 90% reflected.
Two steps later: 81%. Three steps later: 73%.
The further into the future, the less the value.
The same intuition as “$9 today is better than $10 tomorrow.”
The Bellman Equation
In the 1950s, Richard Bellman discovered
a principle for solving complex decision problems.
The Bellman Equation:
V(s) = R + γV(s')
The value of the current state equals
immediate reward + discounted value of the next state.
This simple equation is the backbone of reinforcement learning.
Why is it powerful?
No need to calculate the entire future.
Just look one step ahead.
If you know the value of the next state, you can calculate the current state’s value.
A recursive structure.
Breaking complex problems into small pieces.
The Action-Value Function: Q-Values
Sometimes state value alone isn’t enough.
“This state is good”—that’s understood.
But “what action should I take to reach this state?” is a different question.
That’s why we need the action-value function.
Q(s, a) = Expected value of taking action a in state s
Q-values evaluate state-action pairs.
V(s) asks “How good is this position?”
Q(s, a) asks “How good is this action at this position?”
The Bellman equation for Q-values:
Q(s, a) = R + γ max Q(s', a')
It uses the Q-value of the best action in the next state.
Assuming an optimal future.
TD Learning: The Power of One Step
Backpropagation adjusts weights
using the difference between prediction and answer.
Reinforcement learning needs a similar mechanism.
But there is no “answer.”
Temporal Difference Learning (TD) solves this.
The core idea:
Use “better estimates” instead of answers.
V(s) ← V(s) + α[R + γV(s') - V(s)]
R + γV(s') is the TD target.
Actual reward one step later + predicted value of next state.
R + γV(s') - V(s) is the TD error.
The difference between prediction and new information.
Update the value gradually by this difference.
The difference from backpropagation:
Backpropagation calculates error after the full output.
TD only waits one step.
No need to wait until the game ends.
Learning is possible at every moment.
Q-Learning
Q-learning applies TD to Q-values.
Q(s, a) ← Q(s, a) + α[R + γ max Q(s', a') - Q(s, a)]
Proposed by Christopher Watkins in 1992.
And in 2013, DeepMind combined this algorithm with deep learning.
DQN (Deep Q-Network).
A model that played Atari games at human level.
The moment reinforcement learning shifted
from “a technique for solving games”
to “a possibility for general intelligence.”
How Q-Learning Works
Q-learning is an off-policy algorithm.
What does that mean?
The action the agent actually takes
can differ from the action used for learning.
Even if the agent takes random actions for exploration,
learning assumes “what if I had taken the optimal action?”
max Q(s', a')
This term is the key.
Assuming the “best choice” in the next state to calculate value.
You don’t have to actually take that action.
Learning the optimal path in imagination.
Convergence of Value
As Q-learning iterates,
Q-values gradually approach the optimal Q-value (Q*).
With conditions:
All state-action pairs must be visited sufficiently.
The learning rate must decrease appropriately.
When these conditions are met,
Q(s, a) converges to the true optimal value.
A mathematically proven fact.
But in reality, the state space is too large.
Possible chess states: 10^43.
You can’t store all Q-values in a table.
That’s why deep learning is needed.
Neural networks approximate Q-values.
When Value Becomes Action
With Q-values, policy becomes simple.
π(s) = argmax Q(s, a)
Select the action with the highest Q-value in each state.
That’s the optimal policy.
If you know value, you know action.
Compress the future into numbers, and present choices become clear.
Summary
Reward is momentary.
Value includes the future.
The Bellman equation decomposes the future recursively.
TD learning updates value with just one step.
Q-learning opens the path to finding optimal actions.
But one paradox remains.
If we knew the future precisely, we wouldn’t need value functions.
We calculate because we don’t know.
Uncertainty is the reason value exists.


Leave a Reply