Reinforcement learning has gained valuable popularity with the relatively recent success of DeepMind’s AlphaGo method to baeat the world champion Go player. The AlphaGo method was educated in part by reinforcement learning on deep neural networks.
This style of learning is a distinct feature of machine learning from the classical supervised and unsupervised paradigms. In reinforcement learning, the network responds to environmental data (called the state) using deep neural networks, and influences the behaviour of an agent to try to optimise a reward.
This technique helps a network to learn how to play sports, such as Atari or other video games, or some other challenge that can be rewritten as a form of game. In this tutorial, a common model of reinforcement learning, I will introduce the broad principles of Q learning, and I will demonstrate how to incorporate deep Q learning in TensorFlow.
Table of Contents
Introduction to reinforcement learning
As mentioned above, reinforcement learning consists of a few basic entities or principles. They are: an environment that creates a condition and reward, and an entity that performs actions in the given environment. In the diagram below, you see this interaction:
The task of the agent in such a setting is to analyse the state and the incentive information it receives and pick an behaviour that maximises the input it receives from the reward. The agent learns by repetitive contact with the world, or, in other words, repeated playing of the game.
In order to succeed, it is necessary for the agent to:
1. Learn the link between states, behaviour and resulting incentives
2. Determine which is the best move to pick from (1)
Implementation (1) requires defining a certain set of principles that can be used to notify (2) and (2) is referred to as the strategy of operation. One of the most common methods of applying (1) and (2) using deep Q is the Deep Q network and the epsilon-greedy policy.
Q learning is a value-based way of delivering data to tell which action an agent can take. To create a table that sums up the benefits of taking action over several game plays in a state is an originally intuitive concept of generating principles on which to base actions. This will keep track of which the most beneficial movements are. For starters, let’s consider a simple game in each state with 3 states and two potential actions-a table may represent the rewards for this game:
|Action 1||Action 2|
You can see in the table above that for this simple game, when the agent is State 1 and takes Action 2, if it takes Action 1, it will receive a reward of 10 but zero reward. In State 2, the condition is reversed, and State 3 eventually resembles State 1. If an agent arbitrarily explored this game and tallied up the behaviour obtained the most reward in any of the three states (storing this knowledge in an array, say), so the above table ‘s practical form will effectively be known.
In other words, if the agent actually selected the behaviour it had learned in the past that had provided the highest reward (learning some form of the table above effectively), it would have learned how to play the game effectively. When it is appropriate to simply build tables by summation, why do we need fancy ideas like Q learning and then neural networks?
Well, the first apparent answer is that the game above is simply very simple, with just 3 states and 2 acts per state. True games are significantly more complex. The principle of delayed reward in the above case is the other significant concept that is absent. An agent has to learn to be able to take steps to properly play the most realistic games, which may not necessarily lead to a reward, but may result in a significant reward later down the road.
|Action 1||Action 2|
If Action 2 is taken in all states in the above mentioned game, the agent moves back to State 1, i.e., it goes back to the beginning. In States 1 to 3, it even gets a credit of 5 as it does so. If, therefore, Action 1 is taken in all States 1-3, the agent shall travel to the next State, but shall not receive a reward until it enters State 4, at which point it shall receive a reward of 20.
In other words, an agent is better off if it does not take Action 2 to get an instantaneous reward of 5, but instead it can choose Action 1 to proceed continuously through the states to get a reward of 20. The agent wants to be able to pick acts that result in a delayed reward when the delayed reward value is too high.
Also Read: Tensorflow Image Classification
The Q learning rule
This encourages us to clarify the Q learning rules. In deep Q learning, the neural network needs to take the present state, s, as a vector and return a Q value for each potential behaviour, a, in that state, i.e. It is necessary to return Q(s, a) for both s and a. This Q(s, a) needs to be revised in training through the following rule:
Q(s,a) = Q(s,a) + alp[r+γmax Q(s’,a ‘) – Q(s,a)] + alp[r+ γmax Q(s’,a’)
This law needs a bit of unpacking for the upgrade. Second, you can see that the new value of Q(s, a) requires changing its existing value by inserting some extra bits on the right hand side of the above equation. Switch left to right. Forget the alpha for a while. Inside the square brackets, we see the first word is r, which stands for the award earned for taking action in states.
This is the instant reward; no deferred satisfaction is involved yet. The next word is the deferred incentive estimation. First of all, we have the γ value that discounts the delayed reward effect, which is always between 0 and 1. More on that in a second. The next term maxa’Q(s, ‘a’) is the maximum Q value available in the next condition.
Let’s make things a little easier-the agent starts in states, takes action a, finishes in states, and then the code specifies the maximum value of Q in states, i.e. max a ‘Q(s’,a’). Why is the Max a ‘Q(s’,a’) sense taken into consideration, then? If it takes effect and in state s, it is known to represent the full possible reward going to the handler.
However, γ discounts this value to take into account that waiting for a possible incentive forever is not desirable for the agent-it is better for the agent to target the largest prize with the least amount of time. Notice that the Q(s’,a)’ value also implicitly retains the highest discounted incentive for the economy after that, i.e. Q(s’,a)’ because it maintains the discounted motivation for the state Q(s’,a)’ and so on.
This is because the agent will select the action not only on the basis of the immediate reward r, but also on the basis of potential future discounted incentives.
Deep Q learning
Deep Q learning follows the Q learning updating law throughout the training phase. In other words, a neural network is created that takes state s as its input, and then the network is trained to produce appropriate Q(s, a) values for each behaviour in state s. The action of the agent will then be selected by taking the action with the largest Q(s, a) value (by taking an argmax from the output of the neural network). This can be seen in the first step of the diagram below:
Action selecting and training steps – Deep Q learning
Once this transfer has been made and an action has been selected, the agent will carry out the action. The agent will then gain feedback on what incentive is being given for taking the action from that state. In keeping with the Q Learning Guideline, the next step we want to do now is to train the network. In the second part, this can be seen in the diagram above.
The state vector s is the x input array for network training, and the y output training sample is the Q(s, a) vector collected during the action’s selection process. However, one of the Q(s,a) values, corresponding to action a, is set to have a goal of r+γQ(s’,a ‘), as can be seen in the figure above. By training the network in this way to tell the agent what behaviour will be the best to select for its long-term benefit, the Q(s, a) output vector from the network will get stronger over time.
Pros of Reinforcement Learning:
- Reinforcement learning can be used to solve very challenging challenges that can not be overcome by conventional approaches.
- This strategy is selected in order to produce long-term results, which are very difficult to achieve.
- This learning pattern is somewhat similar to the learning of human beings. Hence, it is close to reaching perfection.
- The model would correct the mistakes that have occurred during the testing phase.
- If an error is corrected by the model, the chances of the same mistake occurring are slightly lower.
- It would create the best paradigm for a particular problem to be solved.
Cons of Reinforcement Learning
- Reinforcement learning as a scheme is incorrect in many different respects, but it is precisely this quality that makes it useful.
- Too much reinforcement learning can result in states being overwhelmed, which can reduce the results.
- Reinforcement learning is not preferable to being used to solve fundamental problems.
- Reinforcement learning requires a great deal of intelligence and a great deal of computation. It’s data-hungering. That’s why it fits so well in video games, so you can play the game over and over again, and it seems possible to get a lot of details.
- Reinforcement learning assumes that the universe is Markovian, which it is not. The Markovian model describes a sequence of possible events in which the probability of each occurrence depends only on the condition attained in the previous event.
If you want to master machine learning and learn how to train an agent to play tic tac toe, to train a chatbot, etc. check out upGrad’s Machine Learning & Artificial Intelligence PG Diploma course.