Air Quality Analysis and Prediction Using Random Forest

By Rohit Sharma

Updated on Aug 11, 2025 | 6 min read | 1.38K+ views

Share:

Air pollution has become a critical concern in many Indian cities, affecting health and quality of life.

In this project, you’ll perform Air Quality Analysis using real-world data collected from monitoring stations across India. By cleaning the data, exploring pollutant trends, and using machine learning techniques, you gain insights into air quality patterns and build a predictive model to estimate AQI based on pollutant levels.

upGrad's Online Data Science Courses are designed to enhance your expertise in the field. These courses offer expert guidance to help you master essential tools like Python, ML, AI, SQL, and Tableau, develop practical skills, and prepare for a successful career.

Looking for practical Python projects to boost your career prospects? This resource can help: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025

What Should You Know First

Before you begin working on the Air Quality Analysis and Prediction project, make sure you're comfortable with the following tools and concepts:

  • Python programming (You'll use Python to load the dataset, clean and analyse the data, and develop the machine learning model)
  • Pandas and Numpy (These libraries are essential for handling structured data, performing aggregations, and transforming pollutant values)
  • Matplotlib or Seaborn (You'll use these tools to create visualisations like line plots, correlation heatmaps, and city-wise AQI trends.
  • Scikit‑learn basics (Required for building preprocessing pipelines and training models such as Random Forest Regressor for AQI prediction)
  • Data preprocessing (You should understand how to handle missing values, normalise numerical features, and encode categorical data)
  • Model evaluation (Familiarity with metrics like R² score, Root Mean Squared Error (RMSE), and Mean Squared Error (MSE) is necessary to evaluate prediction performance effectively)

Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:

Techniques You'll Use for this Project

To maximise the effectiveness of your Air Quality Analysis and Prediction project, you will utilise the following data science techniques:

  • Exploratory Data Analysis (EDA):
    Analyse national and city-level air quality trends. Identify the most affected cities, common pollutant patterns, and how AQI changes over time.
  • Feature Engineering:
    Clean and transform pollutant readings (e.g., PM2.5, PM10, NO2, SO2, CO, O3) to create features suitable for machine learning models.
  • Data Visualisation:
    Use line charts, bar plots, correlation heatmaps, and interactive Plotly graphs to explore pollutant behaviour, AQI patterns, and city comparisons.
  • Regression Modelling:
    Train a Random Forest Regressor to predict AQI levels based on pollutant values. Evaluate its performance using R² and RMSE.

Discover some beginner-friendly Python projects!- Sales Data Analysis ProjectCustomer Churn Prediction Project: From Data to Decisions

Project Completion Time: The Air Quality Analysis and Prediction project is estimated to take 3 to 4 hours to complete. This duration may vary based on your proficiency with Python in the areas of data preprocessing, visualisation, and model building.

Let’s build this Air Quality Analysis and Prediction project from scratch with clear, step-by-step guidance:

Step No.

Step Title

Purpose in the Project

1 Load the Dataset Read the air quality dataset (station_day.csv) into a DataFrame for analysis.
2 Clean the Data Handle missing or inconsistent values in pollutant readings and AQI columns.
3 Analyse National Air Quality Explore overall air quality trends across India using descriptive statistics and plots.
4 Explore City-Level Pollution Break down pollution levels by city to identify the most and least polluted areas.
5 Visualise Pollutant Trends Use line plots and heatmaps to visualise PM2.5, PM10, NO2, etc., across cities and months.
6 Preprocess the Data for Modelling Select relevant features, handle missing values, and scale numerical data for modelling.
7 Train a Regression Model Build a Random Forest Regressor to predict AQI based on pollutant levels.
8 Evaluate the Model Measure model performance using R² score and RMSE to assess prediction quality.
9 Predict AQI for New Input Use the trained model to predict AQI for hypothetical pollutant values.

Let's get started!

Eager to explore Data Science further? Discover more here! - Handwritten Digit Recognition with CNN Using PythonWeather Forecasting Model Using Machine Learning and Time Series Analysis

Step 1: Download the Dataset

To start your Air Quality Analysis and Prediction project, download the "Air Quality Data India station_day.csv" dataset. This dataset, which contains daily pollutant levels and AQI readings for various Indian cities, is available for free online and on Kaggle.

Step 2: Import Libraries and Dependencies

To get started with your Air Quality Analysis and Prediction project, the first step is to import all the necessary libraries. These libraries help with data manipulation, visualisation, model building, and evaluation.

Here’s the breakdown of what each import does:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

Dive into this project!- Customer Purchase Behaviour Analysis Project Using Python

Step 3: Loading the Dataset

Now that the necessary libraries are imported, the next step is to load your air quality dataset. This dataset contains daily air quality readings from various monitoring stations across Indian cities.

If the file is missing, the script will stop to prevent further errors and print an error message.

try:
    df = pd.read_csv('station_day.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'station_day.csv' not found. Please ensure the file is in the correct directory.")
    exit()

Have a look at this - COVID-19 Project: Data Visualization & Insights

Step 4: Clean and Prepare the Data

Once the dataset is loaded, the next step is to clean and structure it for analysis. This includes parsing dates, handling missing values, and extracting useful features like year, month, and city.

Here's how it's done: 

# Convert 'Date' to datetime and handle errors
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df.dropna(subset=['Date'], inplace=True)
# Define key pollutants and AQI
pollutants = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3', 'AQI']

# Convert pollutant columns to numeric and fill missing values with the column mean
for col in pollutants:
    df[col] = pd.to_numeric(df[col], errors='coerce')
    df[col].fillna(df[col].mean(), inplace=True)
# Create Year and Month columns for trend analysis
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Get City from StationId
df['City'] = df['StationId'].apply(lambda x: x.split('_')[0] if '_' in x else x)
print("Data preparation complete. Shape of cleaned data:", df.shape)

Output:

Data preparation complete. 
Shape of cleaned data: (108035, 19)

Step 5: Exploratory Data Analysis (EDA)

In this step, we analyse air quality trends across India at a national level. We'll look at how the average Air Quality Index (AQI) has changed over the years and months to understand long-term patterns and seasonal variations.

# --- 2. National Level Air Quality Analysis ---
print("\n--- 2. National Level Air Quality Analysis ---")
sns.set_style("darkgrid")

# a. National Average AQI Trend (Yearly)
plt.figure(figsize=(12, 6))
national_yearly_aqi = df.groupby('Year')['AQI'].mean()
sns.lineplot(x=national_yearly_aqi.index, y=national_yearly_aqi.values, marker='o', color='royalblue')
plt.title('National Average AQI Trend in India (2015-2020)', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average AQI', fontsize=12)
plt.xticks(national_yearly_aqi.index.astype(int))
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.savefig('national_yearly_aqi_trend.png')
print("Generated 'national_yearly_aqi_trend.png'")

# b. National Average AQI Trend (Monthly/Seasonal)
plt.figure(figsize=(12, 6))
national_monthly_aqi = df.groupby('Month')['AQI'].mean()
sns.lineplot(x=national_monthly_aqi.index, y=national_monthly_aqi.values, marker='o', color='crimson')
plt.title('National Average Monthly AQI (Seasonal Pattern)', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average AQI', fontsize=12)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.savefig('national_monthly_aqi_pattern.png')
print("Generated 'national_monthly_aqi_pattern.png'")

# --- 3. City Level Analysis ---
print("\n--- 3. City Level Analysis ---")
city_avg_aqi = df.groupby('City')['AQI'].mean().sort_values()

# a. Top 10 Most Polluted Cities
top_10_polluted = city_avg_aqi.tail(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_polluted.values, y=top_10_polluted.index, palette='Reds_r')
plt.title('Top 10 Most Polluted Cities in India (by Average AQI)', fontsize=16)
plt.xlabel('Average AQI', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.savefig('top_10_most_polluted_cities.png')
print("Generated 'top_10_most_polluted_cities.png'")

# b. Top 10 Least Polluted Cities
top_10_cleanest = city_avg_aqi.head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_cleanest.values, y=top_10_cleanest.index, palette='Greens_r')
plt.title('Top 10 Least Polluted Cities in India (by Average AQI)', fontsize=16)
plt.xlabel('Average AQI', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.savefig('top_10_least_polluted_cities.png')
print("Generated 'top_10_least_polluted_cities.png'")

# --- 4. Pollutant Analysis ---
print("\n--- 4. Pollutant Analysis ---")
avg_pollutant_levels = df[pollutants[:-1]].mean()
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_pollutant_levels.index, y=avg_pollutant_levels.values, palette='viridis')
plt.title('National Average Concentration of Major Pollutants', fontsize=16)
plt.xlabel('Pollutant', fontsize=12)
plt.ylabel('Average Concentration (µg/m³)', fontsize=12)
plt.savefig('average_pollutant_levels.png')
print("Generated 'average_pollutant_levels.png'")

Output:

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

A simple Python project for beginnersComplete Airline Passenger Traffic Analysis Project Using Python

Step 6: Build the AQI Prediction Model

In this step, we'll build a machine learning model to predict the Air Quality Index (AQI) based on pollutant concentrations like PM2.5, PM10, NO2, SO2, CO, and O3. We'll use a Random Forest Regressor due to its accuracy and ability to handle non-linear relationships.

# Define features (X) and target (y)
features = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']
target = 'AQI'
X = df[features]
y = df[target]

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

Output: 

Training data shape: (86428, 6)
Testing data shape: (21607, 6) 

Don't forget to check this out!- Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Training the AQI Prediction Model

Now that the data is split into training and testing sets, it's time to train a machine learning model. We'll use a Random Forest Regressor to learn the relationship between pollutant levels and AQI.

Here is the code for this step:

# Initialize and train the Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)
print("Model training complete.")

Output: 

Model training complete.
The model is now trained and ready to make predictions on AQI using pollutant data.

Check out this project - Loan Default Risk Analysis Using Machine Learning Techniques.

Step 8: Model Evaluation

After training the model, it's important to evaluate how well it performs on unseen data. We'll use R-squared (R²) and Root Mean Squared Error (RMSE) to assess prediction accuracy.

Here is the code for this step:

# Make predictions using the test set
y_pred = model.predict(X_test)
# Evaluate performance with R² and RMSE
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print the results
print(f"Model R-squared (R²): {r2:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print("An R² score close to 1.0 indicates that the model can explain a large portion of the variance in AQI.")

Output:

Model R-squared (R²): 0.8824
Root Mean Squared Error (RMSE): 40.7108
An R² score close to 1.0 indicates that the model can explain a large portion of the variance in AQI.

Wanna catch fraud in transactions? Take a look at this! - Fraud Detection in Transactions with Python: A Machine Learning Project

Step 9: Example Prediction Using the Model

Now that the model is trained and evaluated, let's use it to predict AQI for a new set of pollutant values. This demonstrates how the model works in real-world use cases.

print("\n--- Example Prediction ---")
# Create a sample input with pollutant levels
hypothetical_data = {
    'PM2.5': [150.5],
    'PM10': [250.0],
    'NO2': [50.2],
    'SO2': [25.8],
    'CO': [2.1],
    'O3': [45.5]
}
example_df = pd.DataFrame(hypothetical_data)

# Display the input data
print("Predicting AQI for the following pollutant levels:")
print(example_df)

# Predict AQI using the trained Random Forest model
predicted_aqi = model.predict(example_df)
# Show the predicted AQI value
print(f"\nPredicted AQI: {predicted_aqi[0]:.2f}")

Output: 

--- Example Prediction ---

Predicting AQI for the following pollutant levels:

   PM2.5   PM10   NO2   SO2   CO    O3

0  150.5  250.0  50.2  25.8  2.1  45.5

Predicted AQI: 317.48

Explore this: Demand Forecasting for E-commerce Using Python (Machine Learning Project)

Final Conclusion

This Air Quality Analysis and Prediction project aimed to analyse and predict air quality using key pollutants. After exploring the data and training a Random Forest model, we evaluated its performance and found it reliable for AQI prediction. Using sample pollutant values, the model predicted an AQI of around 317, showing it's effective in estimating air quality based on pollutant levels. This makes the model a helpful tool for understanding and forecasting air pollution trends.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1Dz2YNoP8Sg1jTskXHk7yAdLAf38nn7RM

Frequently Asked Questions (FAQs)

1. What is the purpose of this project?

2. Which algorithm was used for AQI prediction?

3. How accurate is the model?

4. What kind of input does the model require?

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months