Reinforcement learning is when a learning agent learns to behave optimally according to its environment through constant interactions. The agent goes through various situations, which are also known as states. As you would’ve guessed, reinforcement learning has many applications in our world. Learn more if you are interested to learn more about data science algorithms.
Also, it has many algorithms, among the most popular ones is Q learning. In this article, we’ll be discussing what this algorithm is and how it works.
So, without further ado, let’s get started.
What is Q Learning?
Q learning is a reinforcement learning algorithm, and it focuses on finding the best course of action for a particular situation. It’s off policy because the actions the Q learning function learns from are outside the existing policy, so it doesn’t require one. It focuses on learning a policy that increases its total reward. It’s a simple form of reinforcement learning that uses action values (or Q-values) to enhance the learning agent’s behaviour.
Q learning is one of the most popular algorithms in reinforcement learning, as it’s effortless to understand and implement. The ‘Q’ in Q learning represents quality. As we mentioned earlier, Q learning focuses on finding the best action for a particular situation. And the quality shows how useful a specific action is and what reward it can help you in reaching.
Before we begin discussing how it works, we should first take a look at some essential concepts of q learning. Let’s get started.
Q-values are also known as Action-values. They are represented by Q(S, A), and they give you an estimate of how good the action A is to take at the state S. The model will compute this estimation iteratively by using the Temporal Difference Update rule we’ve discussed later in this section.
Episodes and Rewards
An agent begins from a start state, goes through several transitions, and then moves from its current state to the next one according to its actions and its environment. Whenever the agent takes action, it gets some reward. And when there are no transitions possible, it’s the completion of the episode.
TD-Update (Temporal Difference)
Here’s the TD-Update or Temporal Difference rule:
Q(S,A) Q(S,A) + (R +Q(S’,A’)-Q(S,A))
Here, S represents the agent’s current state, whereas S’ represents the next state. A represents the current action, A’ represents the following best action according to the Q-value estimation, R shows the current reward according to the present action, stands for the discounting factor, and shows the step length.
Example of Q Learning Python
The best way to understand Q learning Python is to see an example. In this example, we are using the gym environment of OpenAI and train our model with it. First off, you’ll have to install the environment. You can do so with the following command:
pip install gym
Now, we’ll import the libraries we’ll need for this example:
import numpy as np
import pandas as pd
from collections import defaultdict
from windy_gridworld import WindyGridworldEnv
Without the necessary libraries, you wouldn’t be able to perform these operations successfully. After we’ve imported the libraries, we will create the environment:
env = WindyGridworldEnv()
Now we’ll create the -greedy policy:
def createEpsilonGreedyPolicy(Q, epsilon, num_actions):
Creates an epsilon-greedy policy based
on a given Q-function and epsilon.
Returns a function that takes the state
as an input and returns the probabilities
for each action in the form of a numpy array
of the length of the action space(set of possible responses).
Action_probabilities = np.ones(num_actions,
dtype = float) * epsilon / num_actions
best_action = np.argmax(Q[state])
Action_probabilities[best_action] += (1.0 – epsilon)
Here’s the code for building a q-learning model:
def qLearning(env, num_episodes, discount_factor = 1.0,
alpha = 0.6, epsilon = 0.1):
Q-Learning algorithm: Off-policy TD control.
Finds the optimal greedy policy while improving
following an epsilon-greedy policy”””
# Action value function
# A nested dictionary that maps
# state -> (action -> action-value).
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Keeps track of useful statistics
stats = plotting.EpisodeStats(
episode_lengths = np.zeros(num_episodes),
episode_rewards = np.zeros(num_episodes))
# Create an epsilon greedy policy function
# appropriately for environment action space
policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n)
# For every episode
for ith_episode in range(num_episodes):
# Reset the environment and pick the first action
state = env.reset()
for t in itertools.count():
# get probabilities of all actions from current state
action_probabilities = policy(state)
# choose action according to
# the probability distribution
action = np.random.choice(np.arange(
p = action_probabilities)
# take action and get reward, transit to next state
next_state, reward, done, _ = env.step(action)
# Update statistics
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
# TD Update
best_next_action = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_action]
td_delta = td_target – Q[state][action]
Q[state][action] += alpha * td_delta
# done is True if episode terminated
state = next_state
return Q, stats
Let’s train the model now:
Q, stats = qLearning(env, 1000)
After we’ve created and trained the model, we can plot the essential stats of the same:
Use this code to run the model and plot the graph. What kind of results do you see? Share your results with us, and if you face any confusion or doubts, let us know.
Also read: Machine Learning Algorithms for Data Science
When you plot the graph, you’ll see that the reward per episode increases progressively over time. And after certain episodes, the plot also reflects that it levels out the high reward limit per episode. What does this indicate?
It means your model has learned to increase the total reward it can earn in an episode by ensuring that it behaves optimally. You must’ve also seen why q learning Python sees applications in so many industries and areas.
If you are curious to learn about data science algorithms, want to learn more about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.