Newton’s third law states that “every action has an equal and opposite reaction.” But that’s not the end of the story. Every action and the subsequent reaction (and the reaction to the reaction and so on…) creates a chain of cause and effect that, if we can see it, allows us to understand how one slight change at the top of chain causes all sorts of different outcomes further along.
In data science, this is what the principle of reinforcement learning is all about – and it’s how we train our AI to do its job better.
At MiQ, we do this to get better results for advertising campaigns. But reinforcement learning is actually something we all do naturally from the day we are born. Think of it like this. When an infant does something (fill in your choice of cute baby action), they see that it leads to a result (fill in your action/reaction which may or may not be quite as cute). When they do this action over and over again, and see that the result is almost always the same, they’ve embarked on their journey of reinforcement learning. And of course, this continues day in and day out throughout our lives. We become who we are based on the evolving neural pathways that are formed as a result of all this continuous learning.
This process of learning through interaction is broadly applicable to all intelligent systems – biological and organizational, as well as computational. In some senses, the ability to learn from interactions and experience, and adapting to an evolving environment is what defines intelligence.
Reinforcement learning is always based on three major principles:
All three aspects are what differentiate reinforcement learning from other forms of machine learning, such as supervised and unsupervised learning. The latter are more focused on understanding and uncovering underlying patterns in data generated from a given scenario without trying to achieve any goal by applying some relevant actions.
For example, in programmatic advertising, if we wanted to learn what features (user attributes, device/browser type, geo location, time of day, etc.) lead to ad clicks, we would use a supervised or unsupervised learning solution to get to the answer. On the other hand, if we wanted to maximize the click-through-rate by placing slightly different versions of the same ad at different times of the day (for example) we would approach it as a reinforcement learning problem.
Here’s how it might work. So, let’s say we’re running a campaign for a luxury car brand. We may start with three different creatives (ad images) highlighting three different aspects of the car, for instance 1) driving pleasure and performance 2) safety and 3) the design aesthetics. We start off by showing all of them equally throughout the day. But, based on the clicks that we see in the initial ‘learning’ phase of the campaign, the learning reinforcement algorithm may find that the ‘driving pleasure and performance’ ad gets more clicks during the day time, while the ‘safety’ one gets more during the evening hours. Based on these learnings, we can then start to rebalance the proportion of the ads served during different hours of the day to create better engagement with the target audience.
One of the classic challenges that arises in reinforcement learning (like in life) is the trade-off between exploration and exploitation. In order to find actions that can increase the end reward, the reinforcement learning algorithm must first explore different actions to figure out what works best.
Once a set of such actions is found, it can start exploiting these actions to generate higher and higher reward. However, given uncertainties and variations in the environment there are no guarantees that the optimal actions will remain the best approach in the future. So, the algorithm must simultaneously continue exploring and evaluating other sets of actions that may be needed in the future. By exploring some ‘sub-optimal’ actions the system ends up foregoing some present rewards (that it would have received had it purely exploited the current best actions) in the expectation of receiving higher rewards in the future.
To go back to the luxury car campaign, this is why the algorithm will never just display 100% of one ad and ignore the other two. So, based on clicks, the reinforcement learning algorithm would serve the ‘driving pleasure’ creative 90% of the time in the day, while sharing the other two creatives five percent each. The first is exploitation – doing what’s working best right now – while the other is exploration, to make sure we keep learning.
Due to some external events and/or campaigns run by competitors (the uncertain external environment) at some point through the campaign we may suddenly start seeing the third creative ‘design aesthetics’ start picking up more clicks. At that point, the learning elements would start kicking in again and the creative proportions would realign to something more like 70-5-25. In that way, we strike the right balance between making the most of what’s working now and being flexible enough to meet a changing environment.
This holistic approach of maximizing some current plus expected future reward in an uncertain and dynamic environment makes this approach more useful than other more static point-in-time pattern recognition approaches. And finally, this also ties in with the broader “sense-plan-act” paradigm of artificial intelligence. But that’s a topic for another blog!