Air Quality Analysis and Prediction Using Random Forest
By Rohit Sharma
Updated on Aug 11, 2025 | 6 min read | 1.38K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 11, 2025 | 6 min read | 1.38K+ views
Share:
Air pollution has become a critical concern in many Indian cities, affecting health and quality of life.
In this project, you’ll perform Air Quality Analysis using real-world data collected from monitoring stations across India. By cleaning the data, exploring pollutant trends, and using machine learning techniques, you gain insights into air quality patterns and build a predictive model to estimate AQI based on pollutant levels.
upGrad's Online Data Science Courses are designed to enhance your expertise in the field. These courses offer expert guidance to help you master essential tools like Python, ML, AI, SQL, and Tableau, develop practical skills, and prepare for a successful career.
Looking for practical Python projects to boost your career prospects? This resource can help: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025
Popular Data Science Programs
Before you begin working on the Air Quality Analysis and Prediction project, make sure you're comfortable with the following tools and concepts:
Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:
To maximise the effectiveness of your Air Quality Analysis and Prediction project, you will utilise the following data science techniques:
Discover some beginner-friendly Python projects!- Sales Data Analysis Project | Customer Churn Prediction Project: From Data to Decisions
Project Completion Time: The Air Quality Analysis and Prediction project is estimated to take 3 to 4 hours to complete. This duration may vary based on your proficiency with Python in the areas of data preprocessing, visualisation, and model building.
Let’s build this Air Quality Analysis and Prediction project from scratch with clear, step-by-step guidance:
Step No. |
Step Title |
Purpose in the Project |
1 | Load the Dataset | Read the air quality dataset (station_day.csv) into a DataFrame for analysis. |
2 | Clean the Data | Handle missing or inconsistent values in pollutant readings and AQI columns. |
3 | Analyse National Air Quality | Explore overall air quality trends across India using descriptive statistics and plots. |
4 | Explore City-Level Pollution | Break down pollution levels by city to identify the most and least polluted areas. |
5 | Visualise Pollutant Trends | Use line plots and heatmaps to visualise PM2.5, PM10, NO2, etc., across cities and months. |
6 | Preprocess the Data for Modelling | Select relevant features, handle missing values, and scale numerical data for modelling. |
7 | Train a Regression Model | Build a Random Forest Regressor to predict AQI based on pollutant levels. |
8 | Evaluate the Model | Measure model performance using R² score and RMSE to assess prediction quality. |
9 | Predict AQI for New Input | Use the trained model to predict AQI for hypothetical pollutant values. |
Let's get started!
Eager to explore Data Science further? Discover more here! - Handwritten Digit Recognition with CNN Using Python | Weather Forecasting Model Using Machine Learning and Time Series Analysis
To start your Air Quality Analysis and Prediction project, download the "Air Quality Data India station_day.csv" dataset. This dataset, which contains daily pollutant levels and AQI readings for various Indian cities, is available for free online and on Kaggle.
To get started with your Air Quality Analysis and Prediction project, the first step is to import all the necessary libraries. These libraries help with data manipulation, visualisation, model building, and evaluation.
Here’s the breakdown of what each import does:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
import warnings
warnings.filterwarnings('ignore')
Dive into this project!- Customer Purchase Behaviour Analysis Project Using Python
Now that the necessary libraries are imported, the next step is to load your air quality dataset. This dataset contains daily air quality readings from various monitoring stations across Indian cities.
If the file is missing, the script will stop to prevent further errors and print an error message.
try:
df = pd.read_csv('station_day.csv')
print("Dataset loaded successfully.")
except FileNotFoundError:
print("Error: 'station_day.csv' not found. Please ensure the file is in the correct directory.")
exit()
Have a look at this - COVID-19 Project: Data Visualization & Insights
Once the dataset is loaded, the next step is to clean and structure it for analysis. This includes parsing dates, handling missing values, and extracting useful features like year, month, and city.
Here's how it's done:
# Convert 'Date' to datetime and handle errors
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df.dropna(subset=['Date'], inplace=True)
# Define key pollutants and AQI
pollutants = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3', 'AQI']
# Convert pollutant columns to numeric and fill missing values with the column mean
for col in pollutants:
df[col] = pd.to_numeric(df[col], errors='coerce')
df[col].fillna(df[col].mean(), inplace=True)
# Create Year and Month columns for trend analysis
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
# Get City from StationId
df['City'] = df['StationId'].apply(lambda x: x.split('_')[0] if '_' in x else x)
print("Data preparation complete. Shape of cleaned data:", df.shape)
Output:
Data preparation complete.
Shape of cleaned data: (108035, 19)
In this step, we analyse air quality trends across India at a national level. We'll look at how the average Air Quality Index (AQI) has changed over the years and months to understand long-term patterns and seasonal variations.
# --- 2. National Level Air Quality Analysis ---
print("\n--- 2. National Level Air Quality Analysis ---")
sns.set_style("darkgrid")
# a. National Average AQI Trend (Yearly)
plt.figure(figsize=(12, 6))
national_yearly_aqi = df.groupby('Year')['AQI'].mean()
sns.lineplot(x=national_yearly_aqi.index, y=national_yearly_aqi.values, marker='o', color='royalblue')
plt.title('National Average AQI Trend in India (2015-2020)', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average AQI', fontsize=12)
plt.xticks(national_yearly_aqi.index.astype(int))
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.savefig('national_yearly_aqi_trend.png')
print("Generated 'national_yearly_aqi_trend.png'")
# b. National Average AQI Trend (Monthly/Seasonal)
plt.figure(figsize=(12, 6))
national_monthly_aqi = df.groupby('Month')['AQI'].mean()
sns.lineplot(x=national_monthly_aqi.index, y=national_monthly_aqi.values, marker='o', color='crimson')
plt.title('National Average Monthly AQI (Seasonal Pattern)', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average AQI', fontsize=12)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.savefig('national_monthly_aqi_pattern.png')
print("Generated 'national_monthly_aqi_pattern.png'")
# --- 3. City Level Analysis ---
print("\n--- 3. City Level Analysis ---")
city_avg_aqi = df.groupby('City')['AQI'].mean().sort_values()
# a. Top 10 Most Polluted Cities
top_10_polluted = city_avg_aqi.tail(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_polluted.values, y=top_10_polluted.index, palette='Reds_r')
plt.title('Top 10 Most Polluted Cities in India (by Average AQI)', fontsize=16)
plt.xlabel('Average AQI', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.savefig('top_10_most_polluted_cities.png')
print("Generated 'top_10_most_polluted_cities.png'")
# b. Top 10 Least Polluted Cities
top_10_cleanest = city_avg_aqi.head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_cleanest.values, y=top_10_cleanest.index, palette='Greens_r')
plt.title('Top 10 Least Polluted Cities in India (by Average AQI)', fontsize=16)
plt.xlabel('Average AQI', fontsize=12)
plt.ylabel('City', fontsize=12)
plt.savefig('top_10_least_polluted_cities.png')
print("Generated 'top_10_least_polluted_cities.png'")
# --- 4. Pollutant Analysis ---
print("\n--- 4. Pollutant Analysis ---")
avg_pollutant_levels = df[pollutants[:-1]].mean()
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_pollutant_levels.index, y=avg_pollutant_levels.values, palette='viridis')
plt.title('National Average Concentration of Major Pollutants', fontsize=16)
plt.xlabel('Pollutant', fontsize=12)
plt.ylabel('Average Concentration (µg/m³)', fontsize=12)
plt.savefig('average_pollutant_levels.png')
print("Generated 'average_pollutant_levels.png'")
Output:
A simple Python project for beginners - Complete Airline Passenger Traffic Analysis Project Using Python
In this step, we'll build a machine learning model to predict the Air Quality Index (AQI) based on pollutant concentrations like PM2.5, PM10, NO2, SO2, CO, and O3. We'll use a Random Forest Regressor due to its accuracy and ability to handle non-linear relationships.
# Define features (X) and target (y)
features = ['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']
target = 'AQI'
X = df[features]
y = df[target]
# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")
Output:
Training data shape: (86428, 6)
Testing data shape: (21607, 6)
Don't forget to check this out!- Crime Rate Prediction by City Using Python and Machine Learning
Now that the data is split into training and testing sets, it's time to train a machine learning model. We'll use a Random Forest Regressor to learn the relationship between pollutant levels and AQI.
Here is the code for this step:
# Initialize and train the Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)
print("Model training complete.")
Output:
Model training complete.
The model is now trained and ready to make predictions on AQI using pollutant data.
Check out this project - Loan Default Risk Analysis Using Machine Learning Techniques.
After training the model, it's important to evaluate how well it performs on unseen data. We'll use R-squared (R²) and Root Mean Squared Error (RMSE) to assess prediction accuracy.
Here is the code for this step:
# Make predictions using the test set
y_pred = model.predict(X_test)
# Evaluate performance with R² and RMSE
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
# Print the results
print(f"Model R-squared (R²): {r2:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print("An R² score close to 1.0 indicates that the model can explain a large portion of the variance in AQI.")
Output:
Model R-squared (R²): 0.8824
Root Mean Squared Error (RMSE): 40.7108
An R² score close to 1.0 indicates that the model can explain a large portion of the variance in AQI.
Wanna catch fraud in transactions? Take a look at this! - Fraud Detection in Transactions with Python: A Machine Learning Project
Now that the model is trained and evaluated, let's use it to predict AQI for a new set of pollutant values. This demonstrates how the model works in real-world use cases.
print("\n--- Example Prediction ---")
# Create a sample input with pollutant levels
hypothetical_data = {
'PM2.5': [150.5],
'PM10': [250.0],
'NO2': [50.2],
'SO2': [25.8],
'CO': [2.1],
'O3': [45.5]
}
example_df = pd.DataFrame(hypothetical_data)
# Display the input data
print("Predicting AQI for the following pollutant levels:")
print(example_df)
# Predict AQI using the trained Random Forest model
predicted_aqi = model.predict(example_df)
# Show the predicted AQI value
print(f"\nPredicted AQI: {predicted_aqi[0]:.2f}")
Output:
--- Example Prediction ---
Predicting AQI for the following pollutant levels:
PM2.5 PM10 NO2 SO2 CO O3
0 150.5 250.0 50.2 25.8 2.1 45.5
Predicted AQI: 317.48
Explore this: Demand Forecasting for E-commerce Using Python (Machine Learning Project)
This Air Quality Analysis and Prediction project aimed to analyse and predict air quality using key pollutants. After exploring the data and training a Random Forest model, we evaluated its performance and found it reliable for AQI prediction. Using sample pollutant values, the model predicted an AQI of around 317, showing it's effective in estimating air quality based on pollutant levels. This makes the model a helpful tool for understanding and forecasting air pollution trends.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1Dz2YNoP8Sg1jTskXHk7yAdLAf38nn7RM
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources