How experiments in AI influenced our understanding of the human brain

09.06.26 07:14 PM - By Vikram

I am currently reading Max Benett's "A Brief History of Intelligence" - which is profoundly engaging - in which Benett charts the major evolutionary breakthroughs in the development of intelligence. With each chapter, he offers fascinating insights on how intelligence, as we understand today, emerged over time, and in the process, draws parallels between human intelligence and that of AI. Consequently, he records the progress made so far with AI and highlights the pitfalls we may have fallen into in pursuit of achieving human-like intelligence.

The first breakthrough was steering - the ability to move towards a target (actively hunt), aided by bilateral symmetry. However, I'd like to discuss "reinforcement learning", the second breakthrough in this post, because it underscores how experiments in AI have also influenced the study and understanding of the brain.

The relationship between AI and neuroscience is often imagined as a one-way street, where understanding the brain leads to the development of AI. But here is an example of how building machines helped scientists better understand the human brain. It’s an interesting story worth sharing.

The second breakthrough : reinforcement learning

After steering, the second breakthrough in our evolutionary history of intelligence was reinforcement learning, a capability developed by the first vertebrates. But first, what is RL?

You can define it in multiple ways, but at a basic level, reinforcement learning refers to an animal's tendency to pursue behaviors that lead to rewards and avoid behaviors that don’t based on trial-and-error learning. Scientists like Marvin Minsky and Edward Thorndike believed that this mechanism was fundamental to how humans and animals learn.

The hypothesis was simple - an animal is likely to repeat a behavior that produces a satisfying outcome, while not repeat a behavior that produces a discomforting outcome. Therefore, the behavior that leads to a reward gets reinforced (the animal learns to repeat it), while the behavior that doesn’t gets omitted.


As scientists developed AI systems, they believed they could rely on the same seemingly straightforward logic - an AI should strengthen behaviors that lead to rewards and weaken those that do not.

Minsky built SNARC, the world's first neural network based on this principle. However, SNARC started failing miserably in completing even simple tasks or winning simple games. This left Minsky and other scientists of the day confused. Trial-and-error learning seemed intuitive and straightforward. Yet it was barely working in machines. However, they soon realized the problem.
The temporal credit assignment problem 
The challenge was in determining which behavior deserved credit or must be reinforced?

Note: The author gives different examples in the book. I'll give a simple one.

Consider a dog that must perform a sequence of 7 actions before receiving a reward. It must first press a lever, then run through a tunnel, and perform several other actions before finally choosing the green ball over red to obtain a piece of chocolate.

Assume that the final action is reinforced - (dog picking the green ball over red), because it immediately preceded the reward. The dog would never have reached the final stage if it had not performed the earlier correctly.

For example, if the dog had not pressed the lever at the beginning, it might never have been able to enter the tunnel and complete the remaining steps. In that case, the first action was just as important as the last. If the first action is reinforced, then the dog will not learn the sequence of steps that led to the reward. Essentially, the reward is not the result of a single step executed correctly. It's the result of a sequence. So how should credit be distributed?

This became known as the temporal credit assignment problem: when a reward arrives after a long sequence of actions, how do we determine which actions deserve credit?

Scientists needed a way to assign value not only to the final behavior but also to the intermediary actions that made the reward possible.

Richard Sutton's insight: temporal difference learning 

To address this problem, Richard Sutton proposed a powerful idea that later became the foundation of reinforcement learning in AI.
Instead of reinforcing behaviors with actual rewards, what if you reinforced behaviors with predicted rewards?

In other words, an action should be rewarded not because it immediately produces a reward, but because it improves the system's prediction of a reward.

In the dog example, pressing the lever may not produce the chocolate directly. However, pressing the lever makes it possible for the dog to proceed to the next stage of the sequence. As a result, the probability of eventually obtaining the reward increases.

Any action along the path that creates a positive change in the prediction of future reward should themselves be reinforced.

This was a significant departure from the prevailing view that rather than learning only from rewards, intelligent systems should learn from improvements in their expectations of future rewards.

The actor-critic framework 

To illustrate the point, Sutton alongside his colleague Andrew Barto and P. Anderson, proposed the modern computational actor-critic framework

In this framework, two components work together:

  • The actor selects actions.

  • The critic evaluates those actions and provides feedback.

When an action improves the prediction of future reward, the critic generates positive feedback. The actor then becomes more likely to choose the same actions in the future (positive reinforcement).

Conversely, when an action reduces the likelihood of obtaining a reward, the critic generates negative feedback, causing the actor to avoid the same actions (negative reinforcement).This framework provided a practical solution to the credit assignment problem and became one of the foundational ideas in reinforcement learning in machines.

But the question remained: Is that how the human brain actually works? At the time, nobody knew the answer.

Testing the theory: Dopamine and learning 

While Sutton had hoped there was a connection between his idea and the brain, it was Peter Dayan, one of his colleagues, who found it. Peter Dayan and his colleague Read Montague were convinced that the brain implemented some form of temporal difference learning mechanism.

To investigate this, they turned their attention to Dopamine. 

At the time, even now in popular conversations, dopamine was generally understood as a reward or pleasure molecule. Researchers knew that dopamine activity increased when animals received rewards, so it seemed natural to associate dopamine with pleasure.

However, experiments revealed something much more interesting. Read wolfram schultz’s experiments on macaque monkeys

Note: Again, not giving the example from the book my own example.

Consider a dog that hears a bell before receiving a piece of chocolate. Initially, the dog has not learned the association between the bell and the reward. When the dog receives the chocolate, dopamine neurons exhibit a strong burst of activity.

But after repeated trials, something unexpected happens.

The dopamine burst gradually shifts from the reward itself to the cue that predicts the reward. Eventually, the strongest dopamine response occurs when the bell rings, not when the chocolate arrives.

This led scientists to an important insight: dopamine is not merely a reward signal. Instead, it functions as a learning signal, encoding what researchers call a reward prediction error, the difference between expected and actual outcomes.

The dopamine responses align exactly with Sutton’s temporal difference learning signal. In other words, dopamine helps the brain learn whether the world is better or worse than expected.

Ok, the critic has been found. But who is the actor?

Dopamine as the Critic and the Basal Ganglia as the Actor

The basal ganglia is the seat of habits, which are automated motor responses or myelin sheaths (proteins+lipids) . Through repeated feedback from dopamine-based learning signals, it gradually strengthens behaviors that increase the likelihood of future rewards and weaken behaviors that do not.

So in this framework:

  • Dopamine systems act as the critic, evaluating outcomes and generating learning signals when expectations change.

  • The basal ganglia acts as the actor, selecting and reinforcing actions based on those signals.

Conclusion

There’s no larger purpose behind sharing this article. Like me, I am sure many of you get excited when you read something quite fascinating. The idea for this post stemmed from such a sense of fascination. With AI becoming such a central part of our everyday lives, I found it interesting to read a small part of the history behind its development.

We often think of neuroscience inspiring AI. In this case, however, AI research helped scientists formulate hypotheses about how the brain might solve a fundamental learning problem.

In other words, an idea developed to improve machines ended up revealing something about our brains. How exciting!!

Vikram

Vikram