IPL Match Winner Prediction using Logistic Regression
By Rohit Sharma
Updated on Aug 06, 2025 | 13 min read | 1.35K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 06, 2025 | 13 min read | 1.35K+ views
Share:
Table of Contents
The Indian Premier League (IPL) is one of the most popular and competitive cricket tournaments in the world. With millions of fans and high-stakes matches, predicting the outcome of a game is both exciting and challenging
This project focuses on IPL Match Winner Prediction using machine learning. It analyses past match data, like teams, venue, toss, and decisions, to predict which team is likely to win. The model is built using Python and Logistic Regression, offering a practical application of data science in sports analytics.
Embark on a journey into the realm of data science. upGrad offers Online Data Science Courses encompassing Python, Machine Learning, AI, SQL, and Tableau. These programs are instructed by experts; interested individuals are encouraged to enrol.
Explore this collection of Python Data Science Projects for all skill levels.
To work smoothly on the IPL Match Winner Prediction project, make sure you're comfortable with the following:
If you're new to Python, check out this free upGrad course to boost your skills!- Learn Basic Python Programming
upGrad's globally recognised programs empower you to lead and innovate in a data-first world. Master Generative AI, solve real-world problems with Advanced Analytics, learn from industry veterans, and earn valuable credentials.
To build and evaluate the IPL match winner prediction model, you’ll use widely adopted Python libraries and tools for data preprocessing, classification, and evaluation. Here’s what you’ll need:
Tool / Library |
Purpose |
Python | Core programming language for writing and running the code |
Google Colab | Free online platform to execute Python code with pre-installed libraries |
Pandas | Reads the IPL dataset and helps clean and manipulate tabular data |
NumPy | Supports array operations and numerical computations |
LabelEncoder (from sklearn) | Encodes categorical team and venue names into a numerical format |
LogisticRegression | Used to build the classification model that predicts the match winner |
Also Read - Different Types of Regression Models You Need to Know
To predict the winner of an IPL match, we built a classification model using historical match data. The model learns patterns from previous games, like team names, venue, and toss decisions, to estimate the likely winner of upcoming matches. Here's what we did:
Also Read - Difference between Training and Testing Data
Note: This project takes 2 to 3 hours to complete. But it also depends on your familiarity with preprocessing and model training in scikit-learn.
Here’s how you can build this project from scratch using Python and machine learning:
1. Load the IPL Match Dataset
Import historical IPL match data containing details such as: Batting and bowling teams, Toss winner and decision, etc.
2. Clean and Preprocess the Data
Remove duplicates and irrelevant columns (like IDs, dates, and umpires)
3. Explore and Visualise the Data
Use visual tools such as: Countplots, Bar plots and Venue analysis
4. Train a Classification Model
Apply Logistic Regression to classify which team is more likely to win
5. Evaluate Model Performance
Use the accuracy score and the confusion matrix to measure prediction quality
Without any delay, let's get started!
Before we dive into this project, you'll need to grab the dataset for model training and import the necessary libraries. First, head over to Kaggle to download the dataset, and then you can bring in the libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')
# Load the datasets
try:
matches_df = pd.read_csv('matches.csv')
deliveries_df = pd.read_csv('deliveries.csv')
except FileNotFoundError as e:
print(f"Error loading files: {e}")
print("Please make sure 'matches.csv' and 'deliveries.csv' are in the correct directory.")
# Exit gracefully if files are not found
exit()
Alright, after getting all the necessary libraries imported and the data uploaded to Google Colab, we're good to go and ready to kick off this project!
Also Read - Libraries in Python Explained: List of Important Libraries
Before modelling, we'll examine matches.csv to understand its structure, columns, data types, and missing values.
print("--- Initial Exploration of matches.csv ---")
print("First 5 rows of the matches dataset:")
print(matches_df.head())
print("\nInformation about the matches dataset:")
matches_df.info()
Output:
--- Initial Exploration of matches.csv ---
First 5 rows of the matches dataset:
id season city date match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
venue team1 \
0 M Chinnaswamy Stadium Royal Challengers Bangalore
1 Punjab Cricket Association Stadium, Mohali Kings XI Punjab
2 Feroz Shah Kotla Delhi Daredevils
3 Wankhede Stadium Mumbai Indians
4 Eden Gardens Kolkata Knight Riders
team2 toss_winner toss_decision \
0 Kolkata Knight Riders Royal Challengers Bangalore field
1 Chennai Super Kings Chennai Super Kings bat
2 Rajasthan Royals Rajasthan Royals bat
3 Royal Challengers Bangalore Mumbai Indians bat
4 Deccan Chargers Deccan Chargers bat
winner result result_margin target_runs \
0 Kolkata Knight Riders runs 140.0 223.0
1 Chennai Super Kings runs 33.0 241.0
2 Delhi Daredevils wickets 9.0 130.0
3 Royal Challengers Bangalore wickets 5.0 166.0
4 Kolkata Knight Riders wickets 5.0 111.0
target_overs super_over method umpire1 umpire2
0 20.0 N NaN Asad Rauf RE Koertzen
1 20.0 N NaN MR Benson SL Shastri
2 20.0 N NaN Aleem Dar GA Pratapkumar
3 20.0 N NaN SJ Davis DJ Harper
4 20.0 N NaN BF Bowden K Hariharan
Information about the matches dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1095 non-null int64
1 season 1095 non-null object
2 city 1044 non-null object
3 date 1095 non-null object
4 match_type 1095 non-null object
5 player_of_match 1090 non-null object
6 venue 1095 non-null object
7 team1 1095 non-null object
8 team2 1095 non-null object
9 toss_winner 1095 non-null object
10 toss_decision 1095 non-null object
11 winner 1090 non-null object
12 result 1095 non-null object
13 result_margin 1076 non-null float64
14 target_runs 1092 non-null float64
15 target_overs 1092 non-null float64
16 super_over 1095 non-null object
17 method 21 non-null object
18 umpire1 1095 non-null object
19 umpire2 1095 non-null object
dtypes: float64(3), int64(1), object(16)
To ensure the dataset is clean and ready for analysis, we handle missing values and standardise inconsistent team names. This step improves data quality and avoids issues during modelling.
print("\n--- Data Cleaning and Preprocessing ---")
# Handling missing values in 'city'
matches_df['city'].fillna('Unknown', inplace=True)
# Display original team names
print("\nOriginal Team Names:")
print(sorted(matches_df['team1'].unique()))
# Standardizing inconsistent team names
matches_df.replace({
'Rising Pune Supergiant': 'Rising Pune Supergiants',
'Delhi Daredevils': 'Delhi Capitals',
'Kings XI Punjab': 'Punjab Kings'
}, inplace=True)
# Apply changes to relevant columns
for col in ['team1', 'team2', 'toss_winner', 'winner']:
matches_df[col] = matches_df[col].replace({
'Rising Pune Supergiant': 'Rising Pune Supergiants',
'Delhi Daredevils': 'Delhi Capitals',
'Kings XI Punjab': 'Punjab Kings'
})
# Display cleaned team names
print("\nStandardized Team Names:")
print(sorted(matches_df['team1'].unique()))
Output:
Original Team Names:
['Chennai Super Kings', 'Deccan Chargers', 'Delhi Capitals', 'Delhi Daredevils', 'Gujarat Lions', 'Gujarat Titans', 'Kings XI Punjab', 'Kochi Tuskers Kerala', 'Kolkata Knight Riders', 'Lucknow Super Giants', 'Mumbai Indians', 'Pune Warriors', 'Punjab Kings', 'Rajasthan Royals', 'Rising Pune Supergiant', 'Rising Pune Supergiants', 'Royal Challengers Bangalore', 'Royal Challengers Bengaluru', 'Sunrisers Hyderabad']
Standardised Team Names:
['Chennai Super Kings', 'Deccan Chargers', 'Delhi Capitals', 'Gujarat Lions', 'Gujarat Titans', 'Kochi Tuskers Kerala', 'Kolkata Knight Riders', 'Lucknow Super Giants', 'Mumbai Indians', 'Pune Warriors', 'Punjab Kings', 'Rajasthan Royals', 'Rising Pune Supergiants', 'Royal Challengers Bangalore', 'Royal Challengers Bengaluru', 'Sunrisers Hyderabad']
Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data
In this step, we explore patterns and trends in the dataset using visualisations. These insights help us understand team performances, seasonal trends, toss decisions, and match outcomes.
print("\n--- Starting Exploratory Data Analysis ---")
# Set plot style
sns.set_style("whitegrid")
# Plot 1: Number of matches played each season
plt.figure(figsize=(12, 6))
matches_df['season_str'] = matches_df['season'].astype(str)
sns.countplot(x='season_str', data=matches_df, order=sorted(matches_df['season_str'].unique()), palette='magma')
plt.title('Number of Matches Played Each Season', fontsize=16)
plt.xlabel('Season', fontsize=12)
plt.ylabel('Number of Matches', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('matches_per_season.png')
print("\nGenerated plot: 'matches_per_season.png'")
# Plot 2: Number of matches won by each team
plt.figure(figsize=(12, 8))
winner_counts = matches_df['winner'].value_counts()
winner_counts = winner_counts[winner_counts > 0]
sns.barplot(y=winner_counts.index, x=winner_counts.values, palette='viridis')
plt.title('Total Matches Won by Each Team', fontsize=16)
plt.xlabel('Number of Matches Won', fontsize=12)
plt.ylabel('Team', fontsize=12)
plt.tight_layout()
plt.savefig('matches_won_by_team.png')
print("Generated plot: 'matches_won_by_team.png'")
# Plot 3: Impact of Toss Decision
plt.figure(figsize=(7, 7))
toss_decision_counts = matches_df['toss_decision'].value_counts()
plt.pie(toss_decision_counts, labels=toss_decision_counts.index, autopct='%1.1f%%', startangle=140, colors=['#FF9999','#66B2FF'], textprops={'fontsize': 14})
plt.title('Toss Decision Percentage', fontsize=16)
plt.ylabel('')
plt.savefig('toss_decision_pie_chart.png')
print("Generated plot: 'toss_decision_pie_chart.png'")
# Feature Engineering: Toss Winner vs Match Winner
matches_df['toss_winner_is_match_winner'] = np.where(matches_df['toss_winner'] == matches_df['winner'], 'Yes', 'No')
# Plot 4: Toss Winner vs. Match Winner
plt.figure(figsize=(8, 6))
sns.countplot(x='toss_winner_is_match_winner', data=matches_df, palette='coolwarm')
plt.title('Does the Toss Winner Become the Match Winner?', fontsize=16)
plt.xlabel('Toss Winner is Match Winner', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(fontsize=12)
plt.savefig('toss_winner_vs_match_winner.png')
print("Generated plot: 'toss_winner_vs_match_winner.png'")
# Analysis of Wins by Batting/Bowling First
matches_won_by_batting_first = matches_df[matches_df['result'] == 'runs'].shape[0]
matches_won_by_bowling_first = matches_df[matches_df['result'] == 'wickets'].shape[0]
print(f"\nAnalysis of Match Outcomes:")
print(f"Number of matches won by batting first: {matches_won_by_batting_first}")
print(f"Number of matches won by bowling first: {matches_won_by_bowling_first}")
Output:
Popular Data Science Programs
Analysis of Match Outcomes:
Number of matches won by batting first: 498
Number of matches won by bowling first: 578
Also Read - Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools, Types, and Best Practices
In this step, we prepare the IPL match data for machine learning. We aim to predict whether Team 1 will win a match using historical match features such as season, city, toss winner, and toss decision. This setup allows us to frame a binary classification problem.
# Create a copy of the dataframe for ML processing
ml_df = matches_df.copy()
# Remove rows where winner is NaN (tie/no result matches)
ml_df = ml_df.dropna(subset=['winner'])
# Create target variable: 1 if team1 wins, 0 if team2 wins
ml_df['team1_wins'] = (ml_df['team1'] == ml_df['winner']).astype(int)
print(f"Total matches for ML training: {len(ml_df)}")
print(f"Team1 wins: {ml_df['team1_wins'].sum()}")
print(f"Team2 wins: {len(ml_df) - ml_df['team1_wins'].sum()}")
# Select features for our model
features_to_use = ['season', 'city', 'toss_winner', 'toss_decision']
# Initialize label encoders
label_encoders = {}
# Encode categorical variables
for feature in features_to_use:
le = LabelEncoder()
ml_df[feature + '_encoded'] = le.fit_transform(ml_df[feature])
label_encoders[feature] = le
print(f"Encoded {feature}: {len(le.classes_)} unique values")
# Create additional features
ml_df['toss_winner_is_team1'] = (ml_df['team1'] == ml_df['toss_winner']).astype(int)
# Final feature set
feature_columns = ['season_encoded', 'city_encoded', 'toss_decision_encoded', 'toss_winner_is_team1']
X = ml_df[feature_columns]
y = ml_df['team1_wins']
print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print("\nFeatures used in the model:")
for i, col in enumerate(feature_columns, 1):
print(f"{i}. {col}")
Output:
Total matches for ML training: 1090
Team1 wins: 555
Team2 wins: 535
Encoded season: 17 unique values
Encoded city: 37 unique values
Encoded toss_winner: 16 unique values
Encoded toss_decision: 2 unique values
Feature matrix shape: (1090, 4)
Target vector shape: (1090,)
Features used in the model:
1. season_encoded
2. city_encoded
3. toss_decision_encoded
4. Toss_winner_is_team1
Also Read - 5 Must-Know Steps in Data Preprocessing for Beginners!
After preparing the feature matrix (X) and target variable (y), we split the data into 80% for training and 20% for testing. This stratified split maintains the proportion of Team1 wins, allowing us to train the model and evaluate it on unseen data.
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing, 80% for training
random_state=42, # For reproducible results
stratify=y # Maintain the same proportion of wins/losses in both sets
)
print(f"Training set size: {X_train.shape[0]} matches")
print(f"Testing set size: {X_test.shape[0]} matches")
print(f"Training set - Team1 wins: {y_train.sum()} ({y_train.mean():.2%})")
print(f"Testing set - Team1 wins: {y_test.sum()} ({y_test.mean():.2%})")
Output:
Training set size: 872 matches
Testing set size: 218 matches
Training set - Team1 wins: 444 (50.92%)
Testing set - Team1 wins: 111 (50.92%)
Check this Project in Python: Sales Data Analysis Project – Learn, Analyze & Drive Business Growth!
With the data split and preprocessed, we now train a Logistic Regression model. This algorithm is widely used for binary classification problems like predicting match outcomes (Team1 wins or not).
We use the LogisticRegression class from scikit-learn, with max_iter=1000 to ensure the model converges during training.
# Initialize and train the model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)
print("Model training completed!")
# Display feature importance (coefficients)
print("\nFeature Importance (Coefficients):")
feature_importance = pd.DataFrame({
'Feature': feature_columns,
'Coefficient': model.coef_[0],
'Abs_Coefficient': np.abs(model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)
for _, row in feature_importance.iterrows():
print(f"{row['Feature']}: {row['Coefficient']:.4f}")
Output:
Feature Importance (Coefficients):
toss_decision_encoded: -0.1439
season_encoded: -0.0090
toss_winner_is_team1: -0.0037
city_encoded: 0.0001
After training the logistic regression model, the next step is to evaluate how well it performs on both the training and testing datasets.
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy:.2%})")
print(f"Testing Accuracy: {test_accuracy:.4f} ({test_accuracy:.2%})")
# Check for overfitting
if train_accuracy - test_accuracy > 0.1:
print("Warning: Potential overfitting detected (training accuracy much higher than testing)")
elif test_accuracy > train_accuracy:
print("Good sign: Model generalizes well to unseen data")
else:
print("Model performance looks reasonable")
# Detailed classification report
print("\n--- Detailed Classification Report ---")
print("Testing Set Performance:")
print(classification_report(y_test, y_test_pred, target_names=['Team2 Wins', 'Team1 Wins']))
# --- Confusion Matrix ---
print("\n--- Confusion Matrix Analysis ---")
# Calculate confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Team2 Wins', 'Team1 Wins'],
yticklabels=['Team2 Wins', 'Team1 Wins'])
plt.title('Confusion Matrix - IPL Match Prediction', fontsize=16)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
print("Generated plot: 'confusion_matrix.png'")
# Interpret confusion matrix
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"True Negatives (Team2 wins, predicted Team2): {tn}")
print(f"False Positives (Team2 wins, predicted Team1): {fp}")
print(f"False Negatives (Team1 wins, predicted Team2): {fn}")
print(f"True Positives (Team1 wins, predicted Team1): {tp}")
# Calculate additional metrics
precision_team1 = tp / (tp + fp) if (tp + fp) > 0 else 0
recall_team1 = tp / (tp + fn) if (tp + fn) > 0 else 0
precision_team2 = tn / (tn + fn) if (tn + fn) > 0 else 0
recall_team2 = tn / (tn + fp) if (tn + fp) > 0 else 0
print(f"\nAdditional Metrics:")
print(f"Precision for Team1 wins: {precision_team1:.4f}")
print(f"Recall for Team1 wins: {recall_team1:.4f}")
print(f"Precision for Team2 wins: {precision_team2:.4f}")
print(f"Recall for Team2 wins: {recall_team2:.4f}")
Output:
Training Accuracy: 0.5080 (50.80%)
Testing Accuracy: 0.5734 (57.34%)
Good sign: Model generalizes well to unseen data
--- Detailed Classification Report ---
Testing Set Performance:
precision recall f1-score support
Team2 Wins 0.58 0.46 0.51 107
Team1 Wins 0.57 0.68 0.62 111
accuracy 0.57 218
macro avg 0.58 0.57 0.57 218
weighted avg 0.58 0.57 0.57 218
--- Confusion Matrix Analysis ---
Generated plot: 'confusion_matrix.png'
Confusion Matrix Breakdown:
True Negatives (Team2 wins, predicted Team2): 49
False Positives (Team2 wins, predicted Team1): 58
False Negatives (Team1 wins, predicted Team2): 35
True Positives (Team1 wins, predicted Team1): 76
Additional Metrics:
Precision for Team1 wins: 0.5672
Recall for Team1 wins: 0.6847
Precision for Team2 wins: 0.5833
Recall for Team2 wins: 0.4579
Also Read - Demystifying Confusion Matrix in Machine Learning [Astonishing]
After evaluating the model, it’s helpful to understand which features had the most influence on the prediction. This step visualises the coefficients of the logistic regression model to interpret their impact.
plt.figure(figsize=(10, 6))
# Sort features by their coefficients
feature_importance_sorted = feature_importance.sort_values('Coefficient')
# Color based on sign of coefficient
colors = ['red' if coef < 0 else 'blue' for coef in feature_importance_sorted['Coefficient']]
# Horizontal bar plot
plt.barh(feature_importance_sorted['Feature'],
feature_importance_sorted['Coefficient'],
color=colors, alpha=0.7)
plt.title('Feature Importance in Logistic Regression Model', fontsize=16)
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
plt.tight_layout()
# Save the plot
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
print("Generated plot: 'feature_importance.png'")
Output:
Also Read - Feature Engineering for Machine Learning: Process, Techniques, and Examples
This project focused on IPL Match Winner Prediction using a Logistic Regression model. After exploring and preprocessing the dataset, we trained the model using features like toss decision, venue, and teams involved. The model was trained on 872 matches and tested on 218, achieving an accuracy of 57.34%. The toss decision turned out to be the most influential factor in determining the winner, while the city had the least effect. Though the model offers moderate accuracy, it highlights the impact of match conditions on outcomes and sets a baseline for further improvements with more advanced models or richer feature sets
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Collab Link:
https://colab.research.google.com/drive/1k3iHLso9gcVPvy15KeJBIrN4MRGS7_TA?usp=sharing
826 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources