Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconReinforcement Learning With Tensorflow Agents [2024]

Reinforcement Learning With Tensorflow Agents [2024]

Last updated:
30th Sep, 2022
Read Time
10 Mins
share image icon
In this article
Chevron in toc
View All
Reinforcement Learning With Tensorflow Agents [2024]

Reinforcement learning has gained valuable popularity with the relatively recent success of DeepMind’s AlphaGo method to baeat the world champion Go player. The AlphaGo method was educated in part by reinforcement learning on deep neural networks.

Top Machine Learning and AI Courses Online

This style of learning is a distinct feature of machine learning from the classical supervised and unsupervised paradigms. In reinforcement learning, the network responds to environmental data (called the state) using deep neural networks, and influences the behaviour of an agent to try to optimise a reward.

This technique helps a network to learn how to play sports, such as Atari or other video games, or some other challenge that can be rewritten as a form of game. In this tutorial, a common model of reinforcement learning, I will introduce the broad principles of Q learning, and I will demonstrate how to incorporate deep Q learning in TensorFlow.

Ads of upGrad blog

Introduction to reinforcement learning

As mentioned above, reinforcement learning consists of a few basic entities or principles. They are: an environment that creates a condition and reward, and an entity that performs actions in the given environment. In the diagram below, you see this interaction:

The task of the agent in such a setting is to analyse the state and the incentive information it receives and pick an behaviour that maximises the input it receives from the reward. The agent learns by repetitive contact with the world, or, in other words, repeated playing of the game.

In order to succeed, it is necessary for the agent to:

1. Learn the link between states, behaviour and resulting incentives

2. Determine which is the best move to pick from (1)

Implementation (1) requires defining a certain set of principles that can be used to notify (2) and (2) is referred to as the strategy of operation. One of the most common methods of applying (1) and (2) using deep Q is the Deep Q network and the epsilon-greedy policy.

Learn: Most Popular 5 TensorFlow Projects for Beginners

Q learning

Q learning is a value-based way of delivering data to tell which action an agent can take. To create a table that sums up the benefits of taking action over several game plays in a state is an originally intuitive concept of generating principles on which to base actions. This will keep track of which the most beneficial movements are. For starters, let’s consider a simple game in each state with 3 states and two potential actions-a table may represent the rewards for this game:

Action 1Action 2
State 1010
State 2100
State 3010

You can see in the table above that for this simple game, when the agent is State 1 and takes Action 2, if it takes Action 1, it will receive a reward of 10 but zero reward. In State 2, the condition is reversed, and State 3 eventually resembles State 1. If an agent arbitrarily explored this game and tallied up the behaviour obtained the most reward in any of the three states (storing this knowledge in an array, say), so the above table ‘s practical form will effectively be known.

In other words, if the agent actually selected the behaviour it had learned in the past that had provided the highest reward (learning some form of the table above effectively), it would have learned how to play the game effectively. When it is appropriate to simply build tables by summation, why do we need fancy ideas like Q learning and then neural networks?

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Deferred reward

Well, the first apparent answer is that the game above is simply very simple, with just 3 states and 2 acts per state. True games are significantly more complex. The principle of delayed reward in the above case is the other significant concept that is absent. An agent has to learn to be able to take steps to properly play the most realistic games, which may not necessarily lead to a reward, but may result in a significant reward later down the road.

Action 1Action 2
State 105
State 205
State 305
State 4200

If Action 2 is taken in all states in the above mentioned game, the agent moves back to State 1, i.e., it goes back to the beginning. In States 1 to 3, it even gets a credit of 5 as it does so. If, therefore, Action 1 is taken in all States 1-3, the agent shall travel to the next State, but shall not receive a reward until it enters State 4, at which point it shall receive a reward of 20.

In other words, an agent is better off if it does not take Action 2 to get an instantaneous reward of 5, but instead it can choose Action 1 to proceed continuously through the states to get a reward of 20. The agent wants to be able to pick acts that result in a delayed reward when the delayed reward value is too high.

Also Read: Tensorflow Image Classification

The Q learning rule

This encourages us to clarify the Q learning rules. In deep Q learning, the neural network needs to take the present state, s, as a vector and return a Q value for each potential behaviour, a, in that state, i.e. It is necessary to return Q(s, a) for both s and a. This Q(s, a) needs to be revised in training through the following rule:

Q(s,a) = Q(s,a) + alp[r+γmax Q(s’,a ‘) – Q(s,a)] + alp[r+ γmax Q(s’,a’)

This law needs a bit of unpacking for the upgrade. Second, you can see that the new value of Q(s, a) requires changing its existing value by inserting some extra bits on the right hand side of the above equation. Switch left to right. Forget the alpha for a while. Inside the square brackets, we see the first word is r, which stands for the award earned for taking action in states.

This is the instant reward; no deferred satisfaction is involved yet. The next word is the deferred incentive estimation. First of all, we have the γ value that discounts the delayed reward effect, which is always between 0 and 1. More on that in a second. The next term maxa’Q(s, ‘a’) is the maximum Q value available in the next condition.

Let’s make things a little easier-the agent starts in states, takes action a, finishes in states, and then the code specifies the maximum value of Q in states, i.e. max a ‘Q(s’,a’). Why is the Max a ‘Q(s’,a’) sense taken into consideration, then? If it takes effect and in state s, it is known to represent the full possible reward going to the handler.

However, γ discounts this value to take into account that waiting for a possible incentive forever is not desirable for the agent-it is better for the agent to target the largest prize with the least amount of time. Notice that the Q(s’,a)’ value also implicitly retains the highest discounted incentive for the economy after that, i.e. Q(s’,a)’ because it maintains the discounted motivation for the state Q(s’,a)’ and so on.

This is because the agent will select the action not only on the basis of the immediate reward r, but also on the basis of potential future discounted incentives. 

Deep Q learning

Deep Q learning follows the Q learning updating law throughout the training phase. In other words, a neural network is created that takes state s as its input, and then the network is trained to produce appropriate Q(s, a) values for each behaviour in state s. The action of the agent will then be selected by taking the action with the largest Q(s, a) value (by taking an argmax from the output of the neural network). This can be seen in the first step of the diagram below:

Action selecting and training steps – Deep Q learning

Once this transfer has been made and an action has been selected, the agent will carry out the action. The agent will then gain feedback on what incentive is being given for taking the action from that state. In keeping with the Q Learning Guideline, the next step we want to do now is to train the network. In the second part, this can be seen in the diagram above.

The state vector s is the x input array for network training, and the y output training sample is the Q(s, a) vector collected during the action’s selection process. However, one of the Q(s,a) values, corresponding to action a, is set to have a goal of r+γQ(s’,a ‘), as can be seen in the figure above. By training the network in this way to tell the agent what behaviour will be the best to select for its long-term benefit, the Q(s, a) output vector from the network will get stronger over time.

Pros of Reinforcement Learning:

  • Reinforcement learning can be used to solve very challenging challenges that can not be overcome by conventional approaches.
  • This strategy is selected in order to produce long-term results, which are very difficult to achieve.
  • This learning pattern is somewhat similar to the learning of human beings. Hence, it is close to reaching perfection.
  • The model would correct the mistakes that have occurred during the testing phase.
  • If an error is corrected by the model, the chances of the same mistake occurring are slightly lower.
  • It would create the best paradigm for a particular problem to be solved.
Ads of upGrad blog

Cons of Reinforcement Learning

  • Reinforcement learning as a scheme is incorrect in many different respects, but it is precisely this quality that makes it useful.
  • Too much reinforcement learning can result in states being overwhelmed, which can reduce the results.
  • Reinforcement learning is not preferable to being used to solve fundamental problems.
  • Reinforcement learning requires a great deal of intelligence and a great deal of computation. It’s data-hungering. That’s why it fits so well in video games, so you can play the game over and over again, and it seems possible to get a lot of details.
  • Reinforcement learning assumes that the universe is Markovian, which it is not. The Markovian model describes a sequence of possible events in which the probability of each occurrence depends only on the condition attained in the previous event.

Popular AI and ML Blogs & Free Courses

What Next?

If you want to master machine learning and learn how to train an agent to play tic tac toe, to train a chatbot, etc. check out upGrad’s Machine Learning & Artificial Intelligence PG Diploma course.


Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1What is TensorFlow?

Python, the programming language popularly used in machine learning, comes with a vast library of functions. TensorFlow is one such Python library launched by Google, which supports quick and efficient numerical calculations. It is an open-source library created and maintained by Google that is extensively used to develop Deep Learning models. TensorFlow is also used along with other wrapper libraries for simplifying the process. Unlike some other numerical libraries that are also used in Deep Learning, TensorFlow was developed for both research and development of applications and the production environment functions. It can execute on machines with single CPUs, mobile devices, and distributed computer systems.

2What are some other libraries like TensorFlow in machine learning?

During earlier days, machine learning engineers used to write all the code for different machine learning algorithms manually. Now writing the same lines of code every time for similar algorithms, statistical and mathematical models was not just time-consuming but also inefficient and tedious. As a workaround, Python libraries were introduced to reuse functions and save time. Python's collection of libraries is vast and versatile. Some of Python's most commonly used libraries are Theano, Numpy, Scipy, Pandas, Matplotlib, PyTorch, Keras, and Scikit-learn, apart from TensorFlow. Python libraries are also easily compatible with C/C++ libraries.

3What are the advantages of using TensorFlow?

The many advantages of TensorFlow make it a hugely popular option to develop computational models in deep learning and machine learning. Firstly, it is an open-source platform that supports enhanced data visualisation formats with its graphical presentation. Programmers can also easily use it to debug nodes which saves time and eliminates the need to examine the entire length of neural network code. TensorFlow supports all kinds of operations, and developers can build any type of model or system on this platform. It is easily compatible with other programming languages like Ruby, C++ and Swift.

Explore Free Courses

Suggested Blogs

Artificial Intelligence course fees
Artificial intelligence (AI) was one of the most used words in 2023, which emphasizes how important and widespread this technology has become. If you
Read More

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges
Introduction Millennials and their changing preferences have led to a wide-scale disruption of daily processes in many industries and a simultaneous g
Read More

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 Feb 2024

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
Summary: In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow. Acquire the dataset Import all the cr
Read More

by Kechit Goyal

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024
Artificial Intelligence (AI) has been one of the hottest buzzwords in the tech sphere for quite some time now. As Data Science is advancing, both AI a
Read More

by upGrad

18 Feb 2024

24 Exciting IoT Project Ideas & Topics For Beginners 2024 [Latest]
Summary: In this article, you will learn the 24 Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Smart Agr
Read More

by Kechit Goyal

18 Feb 2024

Natural Language Processing (NLP) Projects & Topics For Beginners [2023]
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

16 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
footer sticky close icon