The second breakthrough : reinforcement learning
Note: The author gives different examples in the book. I'll give a simple one.
Consider a dog that must perform a sequence of 7 actions before receiving a reward. It must first press a lever, then run through a tunnel, and perform several other actions before finally choosing the green ball over red to obtain a piece of chocolate.
Assume that the final action is reinforced - (dog picking the green ball over red), because it immediately preceded the reward. The dog would never have reached the final stage if it had not performed the earlier correctly.
For example, if the dog had not pressed the lever at the beginning, it might never have been able to enter the tunnel and complete the remaining steps. In that case, the first action was just as important as the last. If the first action is reinforced, then the dog will not learn the sequence of steps that led to the reward. Essentially, the reward is not the result of a single step executed correctly. It's the result of a sequence. So how should credit be distributed?
This became known as the temporal credit assignment problem: when a reward arrives after a long sequence of actions, how do we determine which actions deserve credit?
Scientists needed a way to assign value not only to the final behavior but also to the intermediary actions that made the reward possible.
Richard Sutton's insight: temporal difference learning
In other words, an action should be rewarded not because it immediately produces a reward, but because it improves the system's prediction of a reward.
In the dog example, pressing the lever may not produce the chocolate directly. However, pressing the lever makes it possible for the dog to proceed to the next stage of the sequence. As a result, the probability of eventually obtaining the reward increases.
Any action along the path that creates a positive change in the prediction of future reward should themselves be reinforced.
This was a significant departure from the prevailing view that rather than learning only from rewards, intelligent systems should learn from improvements in their expectations of future rewards.
The actor-critic framework
To illustrate the point, Sutton alongside his colleague Andrew Barto and P. Anderson, proposed the modern computational actor-critic framework
In this framework, two components work together:
The actor selects actions.
The critic evaluates those actions and provides feedback.
When an action improves the prediction of future reward, the critic generates positive feedback. The actor then becomes more likely to choose the same actions in the future (positive reinforcement).
Conversely, when an action reduces the likelihood of obtaining a reward, the critic generates negative feedback, causing the actor to avoid the same actions (negative reinforcement).This framework provided a practical solution to the credit assignment problem and became one of the foundational ideas in reinforcement learning in machines.
But the question remained: Is that how the human brain actually works? At the time, nobody knew the answer.
Testing the theory: Dopamine and learning
While Sutton had hoped there was a connection between his idea and the brain, it was Peter Dayan, one of his colleagues, who found it. Peter Dayan and his colleague Read Montague were convinced that the brain implemented some form of temporal difference learning mechanism.
However, experiments revealed something much more interesting. Read wolfram schultz’s experiments on macaque monkeys
Note: Again, not giving the example from the book my own example.
Consider a dog that hears a bell before receiving a piece of chocolate. Initially, the dog has not learned the association between the bell and the reward. When the dog receives the chocolate, dopamine neurons exhibit a strong burst of activity.
But after repeated trials, something unexpected happens.
The dopamine burst gradually shifts from the reward itself to the cue that predicts the reward. Eventually, the strongest dopamine response occurs when the bell rings, not when the chocolate arrives.
This led scientists to an important insight: dopamine is not merely a reward signal. Instead, it functions as a learning signal, encoding what researchers call a reward prediction error, the difference between expected and actual outcomes.
Dopamine as the Critic and the Basal Ganglia as the Actor
So in this framework:
Dopamine systems act as the critic, evaluating outcomes and generating learning signals when expectations change.
The basal ganglia acts as the actor, selecting and reinforcing actions based on those signals.
Conclusion
There’s no larger purpose behind sharing this article. Like me, I am sure many of you get excited when you read something quite fascinating. The idea for this post stemmed from such a sense of fascination. With AI becoming such a central part of our everyday lives, I found it interesting to read a small part of the history behind its development.
We often think of neuroscience inspiring AI. In this case, however, AI research helped scientists formulate hypotheses about how the brain might solve a fundamental learning problem.
In other words, an idea developed to improve machines ended up revealing something about our brains. How exciting!!
