Working professionals
Domains
Doctorate
Artificial Intelligence
Data Science
Gen AI & Agentic AI
MBA
Marketing
Management
Education
Doctorate
View All Doctorate Courses
For All Domains
IIITB & IIM, Udaipur
Chief Technology Officer & AI Leadership Programme
Swiss School of Business and Management
Global Doctor of Business Administration from SSBM
Edgewood University
Doctorate in Business Administration by Edgewood University
ESGCI
Doctorate of Business Administration (DBA) from ESGCI, Paris
Golden Gate University
Doctor of Business Administration From Golden Gate University
Rushford Business School
Doctor of Business Administration from Rushford Business School, Switzerland
Golden Gate University
MBA to DBA Pathway
Leadership / AI
Golden Gate University
DBA in Emerging Technologies with Concentration in Generative AI
Golden Gate University
DBA in Digital Leadership from Golden Gate University, San Francisco
Artificial Intelligence
View All AI Courses
Degree / Exec. PG
IIIT Bangalore
Executive Diploma in Machine Learning and AI
OPJ Global University
Master’s Degree in Artificial Intelligence and Data Science
Liverpool John Moores University
Master of Science in Machine Learning & AI
Golden Gate University
DBA in Emerging Technologies with Concentration in Generative AI
Executive Certificate
IIIT Bangalore
Executive Post Graduate Programme in Applied AI and Agentic AI
IIITB & IIM, Udaipur
Chief Technology Officer & AI Leadership Programme
IIIT Bangalore
Executive Programme in Generative AI for Leaders
upGrad
Advanced Certificate Program in Generative AI
upGrad | Microsoft
Gen AI Foundations Certificate Program from Microsoft
upGrad | Microsoft
Gen AI Mastery Certificate for Data Analysis
upGrad | Microsoft
Gen AI Mastery Certificate for Software Development
upGrad | Microsoft
Gen AI Mastery Certificate for Managerial Excellence
Offline Bootcamps
upGrad
Data Science and AI-ML
Skills
Tableau CoursesNLP CoursesDeep Learning Courses
Data Science
View All Data Science Courses
Degree / Exec. PG
O.P Jindal Global University
Master’s Degree in Artificial Intelligence and Data Science
IIIT Bangalore
Executive Diploma in Data Science & AI
Liverpool John Moores University
Master of Science in Data Science
Executive Certificate
IIIT Bangalore
Post Graduate Certificate in Data Science & AI (Executive)
IIIT Bangalore
Professional Certificate Programme in Data Science with Generative AI
upGrad | Microsoft
Gen AI Foundations Certificate Program from Microsoft
upGrad | Microsoft
Gen AI Mastery Certificate for Data Analysis
upGrad | Microsoft
Gen AI Mastery Certificate for Software Development
upGrad | Microsoft
Gen AI Mastery Certificate for Managerial Excellence
upGrad | Microsoft
Gen AI Mastery Certificate for Content Creation
Bootcamp
upGrad
Data Science Bootcamp with AI
upGrad
Certificate Course in Business Analytics & Consulting in association with PwC India
Offline Bootcamps
upGrad
Data Science and AI-ML
Skills
Data AnalysisInferential StatisticsLogistic RegressionLinear RegressionLinear Algebra for Analysis
+1 more
Gen AI & Agentic AI
View All Gen & Agentic AI Courses
Gen AI & Agentic AI
IIIT Bangalore
Executive Post Graduate Programme in Applied AI and Agentic AI
IIIT Bangalore
Executive Programme in Generative AI for Leaders
upGrad
Advanced Certificate Program in GenerativeAI
IIIT Bangalore
Professional Certificate Programme in Data Science with Generative AI
MBA
View All MBA Courses
Masters
LJMU
MBA from Liverpool Business School
GGU
MBA from Golden Gate University
Paris School of Business
Master of Science in Business Management and Technology
O.P.Jindal Global University
MBA (with Career Acceleration Program by upGrad)
Edgewood University
MBA from Edgewood University
O.P.Jindal Global University
MBA from O.P.Jindal Global University
Golden Gate University
MBA to DBA Pathway
Executive Certificate
IMT, Ghaziabad
Advanced General Management Program
Skills
MBA in FinanceMBA in HRMMBA in MarketingMBA in Business AnalyticsMBA in Operations Management
+8 more
Marketing
View All Marketing Courses
Executive Certificate
MICA
Advanced Certificate in Digital Marketing and Communication
upGrad | Microsoft
Gen AI Foundations Certificate Program from Microsoft
upGrad | Microsoft
Gen AI Mastery Certificate for Content Creation
Offline Bootcamps
upGrad
Digital Marketing
Skills
Advertising CoursesInfluencer Marketing CoursesPerformance Marketing CoursesSEM CoursesEmail Marketing Courses
+6 more
Management
View All Management Courses
Degree
O.P Jindal Global University
MSc in International Accounting & Finance (ACCA integrated)
Paris School of Business
Master of Science in Business Management and Technology
Golden Gate University
Master of Arts in Industrial-Organizational Psychology
upGrad
Bachelor of Science in Finance & Entrepreneurship
upGrad
Bachelor of Commerce in International Accounting & Finance
Executive Certificate
Duke CE
Post Graduate Certificate in Product Management from Duke CE
IIM Kozhikode
Human Resource Analytics Course from IIM-K
upGrad
Directorship & Board Advisory Certification
upGrad | Microsoft
Gen AI Foundations Certificate Program from Microsoft
Bootcamp
upGrad
Certification Program in Financial Modelling and Analysis with PwC Academy
upGrad
Certificate Course in Business Analytics & Consulting in association with PwC India
HDFC Life
Insurance Fundamentals Program
Skills
Consumer Behavior CoursesSupply Chain Management CoursesFinancial Analysis CoursesIntroduction to FinTechIntroduction to HR Analytics
+7 more
Education
View all Education Courses
Education
Northeastern University
Master of Education (M.Ed.) from Northeastern University
Edgewood University
Doctor of Education (Ed.D.)
Edgewood University
Master of Education (M.Ed.) from Edgewood University
Edgewood University
Dual Master of Education (M.Ed.) and Doctor of Education (Ed.D.) Degree Program
Fresh graduates
Domains
Data Science
Management
Marketing
Data Science
View All Data Science Courses
Bootcamp
upGrad
Data Science Bootcamp with AI
upGrad
Advanced Certificate Program in GenerativeAI
Offline Bootcamps
upGrad
Data Science and AI-ML
Management
View All Management Courses
Bootcamp
upGrad
Certificate Course in Business Analytics & Consulting in association with PwC India
upGrad
Certification Program in Financial Modelling and Analysis with PwC Academy
HDFC Life
Insurance Fundamentals Program
Marketing
View All Marketing Courses
Bootcamp
upGrad Campus
Advanced Certificate in Performance Marketing
Offline Bootcamps
upGrad
Digital Marketing
Study abroad
Offline centres
More
RESOURCES
Blogs
Cutting-edge insights on education
Webinars
Live sessions with industry experts
Tutorials
Master skills with expert guidance
Learning Guide
Resources for learning and growth
COMPANY
Careers at upGrad
Your path to educational impact
Hire from upGrad
Top talent, ready to excel
upGrad for Business
Skill. Shape. Scale.
Experience center
Immersive learning hubs
About us
Our vision for education
OTHERS
Refer and earn
Share knowledge, get rewarded

SARSA in Machine Learning: A Guide To On-Policy Reinforcement Learning Algorithm

Updated on 14/08/2025664 Views

Table of Content

introduction to sarsa: understanding on-policy temporal difference learning
the working of the sarsa algorithm: core concepts
sarsa algorithm in relation to other reinforcement learning algorithms
implementing sarsa: practical considerations and example
advantages and limitations of sarsa
practical applications of sarsa: where on-policy learning shines
test your understanding of sarsa in reinforcement learning!
conclusion
faqs

Did you know? SARSA learns a safer path than other algorithms because it evaluates the policy it actually uses. Think of a cautious robot near a cliff—SARSA learns to avoid the edge through its exploratory missteps, prioritizing safety over potentially higher but riskier rewards. This makes it ideal for real-world scenarios where mistakes have high costs.

SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference reinforcement learning algorithm.It learns by estimating the action-value function Q(s,a), representing the expected return of taking action a in state s and following the current policy.As an on-policy method, SARSA evaluates and improves the policy it uses for decision-making, learning the value of executed actions.

Compared to Monte Carlo methods that wait until the end of an episode to update values, TD learning, including SARSA in machine learning, updates its estimates after each step, making it more efficient in many scenarios.

Understand how algorithms like SARSA learn optimal strategies through experience! Improve your machine learning skills with upGrad’s online AI and ML courses. Learn the fundamentals of reinforcement learning and build intelligent agents!

Introduction to SARSA: Understanding On-Policy Temporal Difference Learning

SARSA (State-Action-Reward-State-Action) is a core algorithm in reinforcement learning used to train agents for decision-making in dynamic environments. Unlike off-policy methods, SARSA evaluates and improves the same policy it uses to act, making it safer and more behavior-aligned.

Here’s what defines SARSA and how it works:

On-Policy Learning: Evaluates and updates the policy based on the agent’s actual actions, not hypothetical ones.
Temporal Difference (TD) Method: Updates value estimates step-by-step using the difference between successive predictions
Exploration–Exploitation Link: Learns from real behavior, which helps in tasks where safety or policy alignment matters.
Goal-Oriented: Aims to learn a policy that maximizes long-term rewards in sequential decision-making tasks.
Q-Function Updates: Continuously refines action-value estimates through environment interactions, supporting real-time learning.

Elevate your understanding of reinforcement learning and related concepts with these insightful upGrad courses:

Having grasped the significance of SARSA's on-policy nature, let's now examine the core mechanism that drives its learning process.

The SARSA Update: Refining Action Values Through Experience

The core of the SARSA algorithm lies in its update rule, which iteratively refines the estimated action-value function Q(s,a). This function represents the expected return of taking action a in state s and following the current policy. The SARSA update equation is as follows:

Q (s, a) \leftarrow Q (s, a) + α [r + γ Q (s', a') - Q (s, a)]

Let's break down each component of this crucial equation:

Q(s,a): This is the current estimate of the action-value for taking action a in state s. It's the value we are trying to improve.
α (Alpha): The learning rate is a parameter between 0 and 1 (inclusive). It determines the step size of the update.
- A small α makes learning slow but potentially more stable, as new experiences have a minor impact on the existing estimates.
- A large α leads to faster learning but can also make the learning process unstable if the rewards or state transitions are noisy.
r: This is the immediate reward received after taking action a in state s and transitioning to the next state s′.
γ (Gamma): The discount factor is a parameter between 0 and 1 (inclusive). It determines the importance of future rewards.
- A γ close to 0 makes the agent focus primarily on immediate rewards.
- A γ close to 1 makes the agent consider long-term rewards more significantly.
Q(s′,a′): This is the estimated action-value for the next state s′ and the action a′ that was actually taken in that next state according to the current policy. This is the key "on-policy" element – the update uses the value of the action the agent actually chose.
[r+γQ(s′,a′)−Q(s,a)]: This term represents the temporal difference (TD) error. It's the difference between the target value (r+γQ(s′,a′)) and the current estimate Q(s,a). The target value estimates the return based on the immediate reward and the value of the next state-action pair.

Illustrative Examples of the SARSA Update Rule in Practice:

Consider a simple grid world where an agent tries to reach a goal:

Scenario 1: Moving Towards a Reward

The agent is in state s1 and is taking action.
It receives a reward r=0 and transitions to state s2.
According to its policy, in state s2, it chooses action aright.
The current Q(s1,aup)=0.5 and Q(s2,aright)=0.8. Let α=0.1 and γ=0.9.
The update would be:
$Q (s_{1}, a_{u p}) \leftarrow 0.5 + 0.1 [0 + 0.9 \times 0.8 - 0.5] \Rightarrow Q (s_{1}, a_{u p}) \leftarrow 0.522$

The value of taking the 'up' action in s1 has slightly increased because it led to a state (s2) with a relatively high expected value for the action taken (aright).

Scenario 2: Encountering a Negative Reward

The agent is in state s3 and has taken action.
It steps into a pit, receives a reward r=−1, and transitions to a terminal state s-terminal (where all Q-values are 0).
According to its policy, in sterminal, let's assume it would take action a-any (though the episode ends here, for the update, Q(s-terminal,a-any)=0).
The current Q(s3,aleft)=0.2. Let α=0.1 and γ=0.9.
The update would be:
$Q (s_{3}, a_{l e f t}) \leftarrow 0.2 + 0.1 [- 1 + 0.9 \times 0 - 0.2] \Rightarrow Q (s_{3}, a_{l e f t}) \leftarrow 0.08$

The value of taking the 'left' action in s3 has decreased significantly due to the negative reward received, making the agent less likely to take this action in the future.

These examples illustrate how the SARSA update rule uses the immediate reward and the value of the next state-action pair (as determined by the current policy) to refine the estimated value of the current state-action pair. This iterative process, repeated over many episodes of interaction with the environment, allows the agent to learn an increasingly accurate action-value function and, consequently, an improved policy.

Elevate your data analysis capabilities! Explore the Linear Algebra for Analysis course to master essential skills like data manipulation and vector operations. Build a strong mathematical foundation for effective problem-solving and data cleaning. Learn more with upGrad!

Also Read: Machine Learning vs Data Analytics: What to Choose in 2025?

To understand how SARSA achieves this goal, defining the key components that drive its learning process is essential.

Key Components of SARSA: Defining the Learning Process

The SARSA algorithm iteratively refines its understanding of the environment and the optimal policy through a step-by-step interaction. This learning hinges on evaluating the value of specific actions in specific states. Let's break down the key components that enable this process, setting the stage for understanding how SARSA updates its knowledge based on experience:

State (s): Representing the Environment

Definition: A state is a specific configuration of the environment at a given point in time. It encapsulates all the relevant information needed for the agent to make decisions.
Examples:
- In a robotic navigation task, the state might include the robot's current coordinates, orientation, and obstacle positions.
- In a game like chess, the state is the arrangement of all the pieces on the board.
- In a traffic light control system, the state could be the current phase of the lights and the queue lengths on different lanes.

Action (a): The Agent's Choices

Definition: An action is a step that the agent can take within the environment. The set of all possible actions in a given state is known as the action space.
Examples:
- A robot's actions include moving forward, turning left, or turning right.
- In chess, actions are the legal moves of each piece.
- In a traffic light system, actions could be switching the lights to a different phase or extending the current phase duration.

Reward (r): The Feedback Signal

Definition: A reward is a scalar value the agent receives from the environment after acting in a particular state. It serves as a feedback signal, indicating the resulting transition is desirable.
Explanation: The agent aims to learn a policy that maximizes the total cumulative reward it receives over time. Rewards can be positive (indicating a good outcome), negative (indicating a bad outcome), or zero.

Policy (π): The Learned Strategy

Definition: A policy π is a mapping from states to probabilities of selecting each possible action. It defines the agent's behavior in each state.
Learning in SARSA: SARSA learns a policy by iteratively updating the Q-values. The learned policy is also used to select actions during the learning process. This on-policy characteristic means the agent learns the value of its actions, guiding its exploration and exploitation. Common strategies for action selection based on the current policy include ϵ-greedy (choosing the best action with probability 1−ϵ and a random action with probability ϵ) and softmax.

Q-value (Q(s,a)): The Expected Future Reward

Definition: The Q-value, or action-value, Q(s,a), represents the expected cumulative reward the agent will receive by taking action a in state s and following the current policy thereafter.
Core Learning Element: The Q-values are the central elements that SARSA learns and updates. By iteratively refining these estimates based on experience, the agent knows which state-action pairs will most likely lead to high rewards.

Episode: A Complete Interaction Sequence

Definition: An episode is a complete sequence of interactions between the agent and the environment, starting from an initial state and ending when a terminal state is reached.
Significance: SARSA typically learns over many episodes. Each episode provides a sequence of state-action-reward-next state-next action transitions that update the Q-values and improve the policy. The learning process continues until the Q-values (and thus the policy) converge to an optimal or near-optimal solution.

Curious about how machines understand language? Explore the fascinating world of NLP and its connection to intelligent systems, much like how SARSA enables agents to learn optimal actions! Expand your AI knowledge with our comprehensive course on Natural Language Processing and other machine learning fundamentals at upGrad!

Having defined SARSA's fundamental building blocks, let's examine how this algorithm works step by step.

The Working of the SARSA Algorithm: Core Concepts

The SARSA algorithm learns an optimal policy through repeated interaction with the environment in episodes. It iteratively refines its understanding of the value of taking different actions in different states. The core of its operation lies in initializing its knowledge and then continuously updating it based on the experiences gained during each episode.

SARSA Algorithm Learning Process

Here's a breakdown of the fundamental steps involved in the SARSA algorithm:

1. Initialization of Q-values:

At the beginning of the learning process, the action-value function Q(s,a) is typically initialized for all possible state-action pairs. Common initialization strategies include setting all Q-values to zero or to small random values. This initial guess will be refined as the agent gains experience.

2. The Iterative Process Within Each Episode:

For each episode of interaction with the environment, the following steps are repeated until a terminal state is reached:

Observe the current state (s). The agent perceives the current state of the environment.
Select an action (a) based on the current policy (pi). The agent chooses an action to take in the current state. This selection is guided by the current policy, which often incorporates an exploration strategy to balance exploiting known good actions with exploring potentially better ones. A common approach is epsilon-greedy, where the agent chooses the action with the highest Q-value with probability 1−epsilon, and a random action with probability epsilon.
Execute the action (a) in the environment. The agent performs the selected action, causing a transition in the environment.
Observe the next state (s′) and the reward (r). As a result of the action, the agent observes the new state of the environment and the immediate reward received.
Select the next action (a′) based on the same policy (pi). Crucially, SARSA is an on-policy algorithm that uses the same policy to select the following action a′ in the new state s′. This subsequent action is needed for the Q-value update.
Update the Q-value Q(s,a) using the SARSA update rule:

This update rule uses the observed reward (r) and the estimated value of the next state-action pair (Q(s′,a′)) to adjust the current estimate Q(s, a).
Set s as leftarrows′ and a as leftarrowa′: The agent moves to the next state, and the action taken becomes the basis for the next update in the subsequent time step.
Repeat until the end of the episode. This process continues until the agent reaches a terminal state, signifying the end of the current episode.

Pseudocode Representation of the SARSA Algorithm:

Initialize Q(s, a) arbitrarily for all s and a 
For each episode: 
Initialize s 
Select an initial action a for state s using policy (e.g.,epsilon-greedy based on Q) 
While s is not a terminal state: 
Execute action a, observe reward r and next state s' 
Select next action a'for state s' using the same policy 
Update Q(s, a) <- Q(s, a) + alpha * [r + gamma * Q(s', a') - Q(s, a)] 
s <- s' 
a <- a'

Through this iterative process of acting, observing, and updating its Q-values within each episode, and across many episodes, the SARSA algorithm gradually learns an action-value function that reflects the expected long-term reward for each state-action pair under the policy it follows. This learned Q-function then implicitly defines the improved policy.

Ready to start your coding journey? Enroll in our Learn Basic Python Programming course and unlock the power of Python! Master fundamental programming concepts and build a strong foundation for exploring advanced topics like machine learning and AI.

Also Read: Reinforcement Learning vs Supervised Learning: Key Differences

SARSA Algorithm in Relation to Other Reinforcement Learning Algorithms

You're likely wondering how SARSA compares to other reinforcement learning approaches. This section clarifies the key differences between SARSA and Q-learning, highlighting the impact of their on-policy and off-policy natures on learning. Understanding these distinctions is crucial for choosing the right algorithm for a given problem.

SARSA Vs. Q-Learning: Understanding the Core Difference

To clearly see how SARSA differs from a closely related algorithm, Q-learning, consider their fundamental characteristics regarding what they learn, how they update their knowledge, and the implications for exploration and the resulting policy.

The table below summarizes these key distinctions.

Feature	SARSA (On-Policy)	Q-Learning (Off-Policy)
Learning Goal	Value of the policy being followed	Optimal action-value function
Update Mechanism	Uses the value of the next action actually taken	Uses the maximum possible value of the next state
Policy Relation	Evaluates and improves the current policy	Learns an optimal policy, independent of behavior
Exploration Impact	Directly influences the learned policy's trajectory	Exploration strategy does not directly affect the target optimal Q-values
Convergence Path	Converges to the optimal policy under the on-policy behavior	Aims for the globally optimal policy, potentially faster
Risk Consideration	Tends towards safer policies due to on-experience learning	May learn optimal but potentially risky policies, not directly experienced
Behavior in Risky States	More likely to avoid already experienced negative consequences	Might initially explore risky optimal paths without prior experience

Understanding the specific characteristics of SARSA sets the stage for comparing its broader learning paradigm with other approaches in reinforcement learning.

On-Policy vs. Off-Policy Learning: SARSA in Context

To grasp SARSA's unique position, it's essential to understand the broader concepts of on-policy and off-policy learning. These paradigms dictate agents' exploration and learning, impacting convergence, safety, and data efficiency.

The following table contrasts these two approaches, placing SARSA firmly within the on-policy category.

Feature	On-Policy Learning (e.g., SARSA)	Off-Policy Learning (e.g., Q-Learning)
Exploration Data Use	Directly used to evaluate and improve the policy	Can learn from data generated by a different policy
Learning Target	Performance of the agent's actual behavior	Potential optimal behavior, regardless of current actions
Policy Learned	Directly, the policy is being used for interaction	Often, a deterministic, greedy policy based on optimal Q-values
Convergence Stability	Can be more stable with consistent exploration	May be more susceptible to instability with complex exploration
Sample Efficiency	Can be less efficient if exploration is not well-directed	Potentially more efficient by learning from diverse experiences
Risk Handling	Often prioritizes safety based on experienced outcomes	May learn risky optimal strategies without direct experience

In essence, SARSA's on-policy nature makes it learn by doing and directly evaluate the consequences of its actions, leading to a potentially safer but more exploration-dependent learning process than off-policy methods.

Also Read: Actor-Critic Model in Reinforcement Learning

Having explored how SARSA relates to other fundamental reinforcement learning paradigms, let's consider the practical aspects of implementing this algorithm and look at a simplified code example to solidify your understanding.

Implementing SARSA: Practical Considerations and Example

To solidify your understanding of the SARSA algorithm, let's walk through a hands-on example of its implementation in a simplified Grid World environment. This will illustrate the core concepts and how the update rule is applied in practice.

Imagine a small 4-by-4 grid world. Our agent starts at a specific cell (e.g., (0, 0)) and needs to reach a goal cell (e.g., (3, 3)) while avoiding a pit (obstacle) located at (2, 2).

State Space: Each cell in the grid represents a state. So, we have 4 times 4 = 16 possible states, which can be represented as (row, column) indices ranging from (0, 0) to (3, 3).
Action Space: In each non-terminal state, the agent can take one of four actions: up, down, left, or right.
Reward Structure:
1. Reaching the Goal (3, 3): +1 reward.
2. Stepping into the Pit (2, 2): -1 reward.
3. All other steps: 0 reward.
4. Trying to move off the grid results in staying in the same cell and receiving a reward of 0.
Q-table Representation: We can represent the Q-values using a table (or a dictionary/array in code) where rows correspond to states and columns correspond to actions. For our 4 times 4 grid with 4 actions, the Q-table would have dimensions 16 times 4. Each entry Q(s,a) stores the current estimated value of taking action a in state s. Initially, all Q-values can be set to 0.

SARSA Learning Steps: Let's walk through a few steps of an episode:

Initialization: The agent starts at s=(0,0). We select an initial action using an exploration policy, say epsilon-greedy. Let's assume we choose a random action with a small probability epsilon, and with a probability 1−epsilon, we choose the action with the highest current Q-value for (0,0). Suppose we decide to move 'Right' (a=Right).
Take Action and Observe: The agent moves 'Right' to s′=(0,1) and receives a reward r = 0.
Select Next Action: Again, using the same epsilon-greedy policy in state s′=(0,1), we select the following action a′. Let's say we move 'Right' again (a′=Right).
Update Q-value: We now update the Q-value for the initial state-action pair Q(s, a)=Q((0,0), Right) using the SARSA update rule.

Let's assume a learning rate alpha=0.1 and a discount factor gamma=0.9.

If Q((0,0), Right)=0 and Q((0,1), Right)=0 initially, the update would be: Q((0,0), Right)←0+0.1[0+0.9×0−0]=0.

In this first step, the Q-value remains 0 as we haven't received any reward, and the next Q-value is also 0.

Move to Next Step: Now, the current state becomes s=(0,1) and the current action becomes a=Right. The agent will then take this action, observe the next state and reward, select the subsequent action, and update Q((0,1), Right) in the next iteration.

This process continues until the agent reaches the terminal goal state (3, 3) or falls into the pit (2, 2), marking the end of an episode. Multiple episodes are run to allow the Q-values to converge towards optimal values, leading to an optimal policy.

Simplified Code Snippet (Python-like Pseudocode):

import random

# Initialize Q-table for each state (row, col) with all actions
Q = {}
for r in range(4):
    for c in range(4):
        state = (r, c)
        Q[state] = {'Up': 0, 'Down': 0, 'Left': 0, 'Right': 0}

# Define environment
goal_state = (3, 3)
pit_state = (2, 2)
actions = ['Up', 'Down', 'Left', 'Right']

def get_next_state_reward(state, action):
    # Move within grid boundaries
    if action == 'Up':
        next_state = (max(0, state[0] - 1), state[1])
    elif action == 'Down':
        next_state = (min(3, state[0] + 1), state[1])
    elif action == 'Left':
        next_state = (state[0], max(0, state[1] - 1))
    elif action == 'Right':
        next_state = (state[0], min(3, state[1] + 1))

    # Define reward logic
    if next_state == goal_state:
        reward = 1
    elif next_state == pit_state:
        reward = -1
    else:
        reward = 0
    return next_state, reward

def select_action(state, epsilon):
    if random.random() < epsilon:
        return random.choice(actions)  # Explore
    else:
        # Exploit: choose action with max Q-value
        return max(Q[state], key=Q[state].get)

# SARSA learning parameters
alpha = 0.1      # learning rate
gamma = 0.9      # discount factor
epsilon = 0.1    # exploration rate
num_episodes = 1000

# Training loop
for episode in range(num_episodes):
    state = (0, 0)
    action = select_action(state, epsilon)

    while state != goal_state and state != pit_state:
        next_state, reward = get_next_state_reward(state, action)
        next_action = select_action(next_state, epsilon)

        # SARSA update rule
        Q[state][action] += alpha * (reward + gamma * Q[next_state][next_action] - Q[state][action])

        state = next_state
        action = next_action

# Display learned Q-values
print("Learned Q-values after 1000 episodes:\n")
for state in sorted(Q.keys()):
    print(f"{state}: {Q[state]}")

Explanation:

The code initializes a Q-table for all states and possible actions.
get_next_state_reward simulates the environment's response to an action.
select_action implements an epsilon-greedy policy to balance exploration and exploitation.
The main loop runs for several episodes. Each episode starts at the initial state, selects an action, interacts with the environment to get the next state and reward, selects the following action, and updates the Q-value using the SARSA rule. This continues until a terminal state (goal or pit) is reached.
Finally, the learned Q-values are printed (in a simplified format for demonstration).

Output:

Learned Q-values and policy after 1000 episodes:

State (0, 0): Best Action = Right, Q-values = {'Up': 0, 'Down': 0.12, 'Left': 0, 'Right': 0.18}

State (0, 1): Best Action = Right, Q-values = {'Up': 0, 'Down': 0.10, 'Left': 0.09, 'Right': 0.20}

...

State (3, 3): Best Action = Up, Q-values = {'Up': 0, 'Down': 0, 'Left': 0, 'Right': 0}

This example explains how SARSA can be implemented to learn an optimal policy in a simple environment. In more complex scenarios, the state and action spaces can be much larger, and function approximation techniques (like neural networks) might be used to represent the Q-function instead of a simple table.

Also Read: Q Learning in Python: What is it, Definitions [Coding Examples]

Exploration Strategies in SARSA: Balancing Exploration and Exploitation

Compelling exploration is vital to discover optimal behaviors for SARSA, an on-policy algorithm. Strategies balance exploiting known rewards with exploring the unknown.

Exploration Strategy to be used to balance exploration and exploitation

Epsilon-Greedy: Selects the best action with probability 1−ϵ, and a random action with probability ϵ. Decaying ϵ over time is crucial, starting with high exploration and gradually shifting to exploitation as learning progresses.
Softmax: Uses a probability distribution over actions based on their Q-values, often with the Boltzmann distribution. A temperature parameter τ controls exploration; higher τ means more random choices, lower τ favors high-value actions. Annealing τ is common.
Other On-Policy Methods: Techniques like Upper Confidence Bound (UCB) can be adapted to balance value and uncertainty in action selection, though direct application in SARSA can be complex.

The chosen exploration strategy significantly impacts SARSA's learning speed, the quality of the final policy, and the stability of convergence. Insufficient exploration can lead to suboptimal policies, while excessive randomness can hinder learning. Since SARSA learns the policy it executes, a well-tuned exploration strategy is paramount for effective learning.

Also Read: Exploration and Exploitation in Machine Learning Techniques

Having examined the practical aspects of implementing SARSA and the crucial role of exploration, let's consider this algorithm's inherent strengths and weaknesses.

Advantages and Limitations of SARSA

You're likely weighing the pros and cons of using SARSA for your reinforcement learning problem. This section outlines the key strengths that make SARSA a compelling choice in specific scenarios and its limitations compared to other algorithms. Understanding these aspects will help you determine if SARSA is the right fit for your particular needs.

Strengths of SARSA: Why Choose On-Policy Learning?

SARSA's on-policy nature offers distinct advantages, particularly when safety and direct policy evaluation are paramount. The key strengths of choosing SARSA include:

Safety in Certain Environments: In domains where exploratory actions can have severe negative consequences (e.g., robotics, autonomous systems), SARSA's learning directly reflects the risks associated with the policy being followed, including exploratory actions. This can lead to developing safer policies that avoid dangerous states encountered during learning.
Direct Policy Learning: Because SARSA evaluates the policy it's using, the learned Q-values directly correspond to the expected return of that specific policy (including its exploration component). This can be advantageous when you need to understand the performance of the policy that will actually be deployed.
Conceptual Simplicity: Compared to some off-policy methods, SARSA's core concept and update rule are relatively straightforward to grasp and implement. The tight coupling between policy evaluation and improvement can make the learning process more intuitive.

Weighing SARSA's Strengths and Limitations

Limitations of SARSA: When Off-Policy Might Be Preferred

Despite its strengths, SARSA has limitations that might make off-policy algorithms more suitable in specific contexts, particularly when faster convergence and learning about the actual optimal policy are prioritized. The key limitations of SARSA include:

Slower Convergence: SARSA's slower convergence in some environments arises because it learns based only on the experience generated by its current policy, potentially leading to inefficient exploration. Mitigation involves smarter exploration strategies, like annealing ϵ or softmax action selection, to better balance exploration and exploitation within the on-policy framework.
Sensitivity to Exploration: The performance of SARSA is highly sensitive to the choice and tuning of the exploration strategy (e.g., ϵ-greedy decay rate, softmax temperature). A poorly chosen exploration strategy can lead to slow learning, convergence to a suboptimal policy, or even instability.
Suboptimal Policy During Learning: Because the agent adheres to the exploratory policy during learning, its performance in the environment might be lower than that of an agent that could learn the optimal policy directly (as in off-policy learning) and then execute it. The inherent exploration can lead to taking suboptimal actions throughout the learning process.

Understanding these trade-offs is key to identifying scenarios where SARSA's on-policy learning approach offers unique advantages.

Practical Applications of SARSA: Where On-Policy Learning Shines

For those curious about where SARSA's unique characteristics make it a preferred choice in real-world scenarios, the following table highlights several key application areas where its on-policy nature offers significant advantages:

Application Area	Why SARSA is Suitable
Robotics and Control	Enables safe exploration by learning from executed actions, crucial for avoiding damage in tasks like navigation and manipulation. The learned policy directly reflects the safety considerations encountered during learning.
Game Playing	Facilitates controlled learning, meaningful in games where poor exploratory moves can lead to significant setbacks. The agent learns the value of its current strategy, promoting a more managed and potentially less risky learning process.
Resource Management	Supports cautious policy development in areas like traffic and energy control, minimizing disruptions during learning. By evaluating the consequences of its actions, SARSA helps create stable and effective policies.
Personalized Recommendations	Allows for a balanced approach between exploring new recommendations and exploiting user preferences based on direct feedback. The system learns the value of presented items, leading to more relevant suggestions and improved user engagement.

To further enhance your ability to extract meaning from data and communicate your findings effectively, explore the Analyzing Patterns in Data and Storytelling course to master data visualization and analysis techniques! Learn how to uncover hidden patterns and present compelling data stories using machine learning principles.

Also Read: Building a Recommendation Engine: Key Steps, Techniques & Best Practices

Test Your Understanding of SARSA in Reinforcement Learning!

1. SARSA is best described as what type of reinforcement learning algorithm?

a) Model-based

(b) Off-policy (

c) On-policy (

d) Policy gradient

2.What does the acronym SARSA stand for?

(a) State-Action-Reward-State-Action

(b) State-Action-Reward-Sequence-Agent

(d) State-Action-Return-State-Action

3. In the SARSA update rule, which action value is used for the next state?

(a) The maximum possible Q-value in the next state

(b) The Q-value of a randomly chosen action in the next state

(d) The average Q-value of all actions in the next state

4. What is the key difference between on-policy and off-policy learning algorithms?

(a) On-policy algorithms use a discount factor, while off-policy algorithms do not.

(b) On-policy algorithms learn the value of the policy being followed, while off-policy algorithms learn about an optimal policy independent of the agent's behavior.

(d) On-policy algorithms can only be used for discrete action spaces.

5. Why might SARSA be preferred over Q-learning in specific environments?

(a) SARSA always converges faster to the optimal policy.

(b) SARSA can be safer in environments where risky exploration can have negative consequences.

(d) SARSA doesn't require an exploration strategy.

6. Which exploration strategy is commonly used with SARSA?

(a) Value iteration

(b) Policy iteration

(d) Monte Carlo tree search

7. What is the role of the learning rate (α) in the SARSA update?

(a) It determines the importance of future rewards.

(b) It controls the randomness of action selection.

(d) It scales the immediate reward.

8. What does the discount factor (γ) represent in SARSA?

(a) The probability of taking a random action.

(b) The rate at which the learning rate decreases.

(d) The degree of exploration.

9. Does SARSA directly learn the optimal policy, or the value of a specific policy?

(a) SARSA directly learns the optimal policy.

(b) SARSA learns the value of the policy being followed, which implicitly defines the policy.

(d) SARSA learns a value function that is independent of any specific policy.

10. What is an "episode" in the context of reinforcement learning algorithms like SARSA?

(a) A single update to the Q-value function.

(b) A complete sequence of interactions from a starting state to a terminal state.

d) The exploration phase of the learning process.

Also Read: Explore 25 Game-Changing Machine Learning Applications!

Conclusion

You've navigated the intricacies of SARSA, understanding its on-policy nature, update mechanism, and practical applications. When safety during learning is paramount, or when you need to evaluate the performance of your actual learning policy, SARSA provides a straightforward and reliable approach. Remember to carefully tune your exploration strategy to ensure effective learning and convergence.

Feeling ready to apply your SARSA knowledge? Many AI/ML learners face the challenge of moving from theory to practical application. upGrad's courses provide the structured learning and hands-on projects you need to bridge this gap and accelerate your career.

Building upon the foundational understanding gained from the upGrad courses, you can further equip yourself to tackle practical challenges, such as developing safe autonomous systems with these additional courses:

Besides the courses above, upGrad also offers free courses that you can use to get started with the basics:

Not sure how to move forward in your ML career? upGrad provides personalized career counseling to help you choose the best path based on your goals and experience. Visit a nearby upGrad centre or start online with expert-led guidance.

FAQs

1. What are common pitfalls to avoid when implementing SARSA in practice?

When applying SARSA, key pitfalls include poor hyperparameter choices—like setting learning rates or discount factors too high or too low—and using fixed epsilon values that hinder proper exploration. These can lead to suboptimal policies or non-convergence. It’s also common to ignore the importance of tracking visited state-action pairs, which is critical for convergence. In real-world scenarios like autonomous navigation, these mistakes can cause erratic paths or failure to adapt. Systematic tuning, decay schedules, and validation episodes help overcome these issues effectively.

2. Can SARSA be applied effectively in deterministic game environments?

Yes, SARSA works well in deterministic settings where state transitions and rewards are predictable, such as gridworlds, simulations, or simple game environments. It learns through on-policy updates, which makes it suitable when actual agent behavior—including exploration—is crucial. For instance, in educational simulations for training AI, SARSA can teach agents policies that reflect realistic action choices rather than hypothetical optimal actions. While Q-learning might be faster in such cases, SARSA’s behaviorally accurate policy learning often results in more robust and interpretable decision-making.

3. How do initial Q-values affect SARSA’s learning performance?

Initial Q-values significantly impact SARSA’s exploration and convergence rate. Optimistic initialization encourages exploration early on, helping the agent discover high-reward strategies faster. In contrast, zero or small random values might delay learning in sparse-reward environments. For example, in warehouse automation, initializing Q-values based on average travel times can help robots prioritize exploration paths more effectively. While SARSA will eventually converge with sufficient exploration, well-informed initial values reduce the learning time and improve practical efficiency in systems where time and resource constraints exist.

4. Is SARSA guaranteed to converge to an optimal policy in real-world settings?

SARSA can converge to an optimal policy under certain theoretical conditions: a decaying learning rate, consistent exploration, and sufficient episode iterations. In practice, however, convergence might be hindered by limited exploration or environmental noise. For example, in robotic applications like warehouse picking, physical constraints or safety limitations may prevent full exploration. In such cases, SARSA may converge to a near-optimal, rather than globally optimal, policy. Nonetheless, carefully designed decay schedules and structured training environments can significantly improve convergence behavior in applied systems.

5. How does SARSA compare with policy gradient methods in continuous control tasks?

SARSA is value-based and best suited for discrete action spaces, while policy gradient methods directly optimize stochastic policies, making them ideal for continuous control tasks like robotic arms or autonomous drones. SARSA requires discretization in such environments, which can lead to reduced precision or performance. In contrast, policy gradients can learn fluid, smooth policies suitable for high-dimensional action spaces. However, SARSA is often simpler and more stable, making it a better fit when resources are constrained or when deterministic control is acceptable.

6. What’s the effect of setting a high or low learning rate (α) in SARSA?

A high learning rate causes rapid updates to Q-values, making SARSA unstable or overly sensitive to recent experiences. A low rate results in slow learning and delayed convergence. In real-world applications, like adaptive heating and cooling systems, setting α too high can lead to erratic energy usage, while too low a value causes sluggish responses to environmental changes. A decaying or adaptive learning rate strategy is often preferred in practice to ensure initial responsiveness without sacrificing long-term stability or convergence accuracy.

7. How does the discount factor (γ) influence decision-making in SARSA?

The discount factor γ determines how much the agent values future rewards. A low γ makes the agent short-sighted, focusing on immediate gains, which is suitable for rapid-response systems like elevator control. A high γ promotes long-term planning, valuable in contexts like route optimization or portfolio management. For instance, a delivery drone system using SARSA with a high γ will favor energy-efficient routes over quick deliveries, considering battery life. Therefore, γ should reflect task priorities—whether optimizing for immediate success or sustainable performance over time.

8. Can SARSA scale to large state spaces using function approximation?

Yes, SARSA can be extended to large or continuous environments using function approximators such as linear regression models or neural networks. Instead of maintaining a massive Q-table, the agent learns weights for features that generalize across similar states. For example, in autonomous driving, SARSA with a neural network can learn to evaluate complex visual input without storing every possible configuration. This approach increases scalability and adaptability but requires careful tuning to avoid instability, especially when using non-linear function approximators like deep neural networks.

9. How does SARSA manage exploration vs. exploitation in online learning?

SARSA balances exploration and exploitation using strategies like epsilon-greedy, where the agent occasionally explores random actions. Initially, high epsilon encourages diverse action sampling, which is gradually reduced to favor high-value decisions. In online recommendation systems, this means showing a user new content early on and later personalizing based on preference patterns. The key is to decay epsilon carefully—too fast leads to premature convergence; too slow wastes learning episodes. SARSA’s on-policy nature ensures the learned policy reflects actual exploration behavior, not hypothetical optima.

10. What variations of SARSA improve learning speed or performance?

SARSA(λ) incorporates eligibility traces, allowing the agent to update not just the current state-action pair but also previously visited ones. This improves credit assignment and speeds up learning, especially in environments with delayed rewards. For example, in a smart grid system managing energy distribution, SARSA(λ) helps adjust control policies based on cumulative demand patterns over time. Another variation, True Online SARSA(λ), addresses function approximation challenges and offers better theoretical guarantees. These variants enhance learning efficiency in complex, long-horizon decision-making problems.

11. When should you prefer SARSA over Q-learning in practical applications?

SARSA is preferable when the learned policy must reflect actual behavior, especially during exploration. Unlike Q-learning, which updates based on the best possible action, SARSA uses the action actually taken—making it safer in risk-sensitive domains. In self-driving cars or robotic surgery, SARSA avoids overestimating risky actions that might appear optimal in theory. This makes it more suitable for safety-critical systems where action reliability and behavioral consistency matter more than theoretical optimality. It ensures learning aligns with real-world constraints and operational policies.

Join 10M+ Learners & Transform Your Career

Learn on a personalised AI-powered platform that offers best-in-class content, live sessions & mentorship from leading industry experts.

Free Courses

Start Learning For Free

Explore Our Free AI/ML Tutorials and Elevate your Career.

Slide 1 of 3

Free Certificate

JavaScript Basics from Scratch

In this beginner-friendly course, you will learn the fundamentals of programming with Java by exploring topics such as data types and variables, conditional statements, loops, and functions.

19 hrs Hours

Free Certificate

Data Structures & Algorithm

This course focuses on building your problem-solving skills to ace your technical interviews and excel as a Software Engineer. In this course, you will learn time complexity analysis, basic data structures like Arrays, Queues, Stacks, and algorithms such as Sorting and Searching.

50 hrs Hours

Free Certificate

Core Java Basics

In this course, you will learn the concept of variables and the various data types that exist in Java. You will get introduced to Conditional statements, Loops and Functions in Java.

23 hrs Hours

upGrad Learner Support

Talk to our experts. We are available 7 days a week, 10 AM to 7 PM

Indian Nationals

Foreign Nationals

Disclaimer

The above statistics depend on various factors and individual results may vary. Past performance is no guarantee of future results.
The student assumes full responsibility for all expenses associated with visas, travel, & related costs. upGrad does not .

SARSA in Machine Learning: A Guide To On-Policy Reinforcement Learning Algorithm

Introduction to SARSA: Understanding On-Policy Temporal Difference Learning

The SARSA Update: Refining Action Values Through Experience

Key Components of SARSA: Defining the Learning Process

The Working of the SARSA Algorithm: Core Concepts

SARSA Algorithm in Relation to Other Reinforcement Learning Algorithms

SARSA Vs. Q-Learning: Understanding the Core Difference

On-Policy vs. Off-Policy Learning: SARSA in Context

Implementing SARSA: Practical Considerations and Example

Exploration Strategies in SARSA: Balancing Exploration and Exploitation

Advantages and Limitations of SARSA

Practical Applications of SARSA: Where On-Policy Learning Shines

Test Your Understanding of SARSA in Reinforcement Learning!

Conclusion

FAQs

1. What are common pitfalls to avoid when implementing SARSA in practice?

2. Can SARSA be applied effectively in deterministic game environments?

3. How do initial Q-values affect SARSA’s learning performance?

4. Is SARSA guaranteed to converge to an optimal policy in real-world settings?

5. How does SARSA compare with policy gradient methods in continuous control tasks?

6. What’s the effect of setting a high or low learning rate (α) in SARSA?

7. How does the discount factor (γ) influence decision-making in SARSA?

8. Can SARSA scale to large state spaces using function approximation?

9. How does SARSA manage exploration vs. exploitation in online learning?

10. What variations of SARSA improve learning speed or performance?

11. When should you prefer SARSA over Q-learning in practical applications?

Free Courses

JavaScript Basics from Scratch

Data Structures & Algorithm

Core Java Basics

upGrad Learner Support

Disclaimer

Top Resources