Crime Rate Prediction by City Using Python and Machine Learning

By Rohit Sharma

Updated on Jul 28, 2025 | 21 min read | 1.18K+ views

Share:

Predicting crime rates using data science is both a challenging and impactful task. In this project, we will explore how machine learning and real-world data can help us understand crime trends across different cities.

We'll work with publicly available datasets from sources like the FBI and UCI, using tools such as Pandas, NumPy, and Scikit-learn. By applying regression and classification models, performing feature engineering, and visualizing spatial data, we aim to uncover the social and economic factors that influence crime.

If you're aiming to fast-track your data science career, explore the Online Data Science Courses offered by upGrad. These courses cover essential tools like Python, Machine Learning, AI, SQL, Tableau, and more, taught by industry-leading faculty. Take the next step and enroll today!

Ignite your next big idea with our expertly curated collection of Python-based data science projects, perfect for sharpening your skills and building real-world experience.

What Should You Know Before Getting Started?

It’s helpful to have some basic knowledge of the following before starting this Crime Rate Prediction project:

  • Python programming (variables, functions, loops, basic syntax)
  • Pandas and Numpy (for handling and analyzing data)
  • Matplotlib or Seaborn (for creating charts and visualizing trends)
  • Scikit-learn (for building and evaluating classification and regression models)
  • Spatial and Demographic Features(incorporating city-wise social and economic indicators into the model)
  • Working with Real-World Datasets(handling missing values, encoding categorical variables, feature scaling)
  • Merging external data (e.g., holiday calendars) for correlation studies

Start your data science career journey with upGrad’s top-ranked courses and gain the opportunity to learn directly from experienced industry mentors.

How We Built This: Tools and Tech Stack

For the Crime Rate Prediction Project, we used the following tools and libraries to gather, process, model, and visualize the data effectively:

Tool / Library

Purpose

Python Core programming language for scripting, analysis, and model building
Pandas Data loading, cleaning, manipulation, and feature engineering
NumPy Numerical operations and efficient array handling
Matplotlib Creating basic plots and visual insights into data trends
Seaborn Advanced statistical visualizations for correlation and distribution plots
Scikit-learn Building and evaluating regression/classification models
Jupyter/Colab Interactive coding and experimentation environment

The Brains Behind the Predictions: Our Models

We’ll use a few straightforward yet powerful techniques to understand and predict crime for the Crime Rate Prediction Project:

  • Regression & Classification Models – Predict crime frequency and type using machine learning algorithms.
  • Feature Engineering – Extract meaningful insights from raw crime, social, and economic data.
  • Correlation Analysis – Identify key factors that influence crime rates.
  • Data Visualization – Use charts and maps to highlight crime trends across cities.
  • Real-World Crime Data – Work with datasets from the FBI and UCI to ensure practical, realistic insights.

Time Taken To Complete Crime Rate Prediction Project

You can complete this Crime Rate Prediction Project in approximately 2 to 3 hours. The difficulty is rated as Moderate, ideal for learners with basic Python knowledge and an interest in applying machine learning to real-world crime data.

How to Build a Crime Rate Prediction by City Project

Let’s start building the project from scratch. We'll go step-by-step through the process of:

  1. Load and Explore the Crime Dataset
  2. Clean and Prepare the Data
  3. Engineer Relevant Features
  4. Visualize Crime Patterns and Trends
  5. Apply Classification and Regression Models
  6. Evaluate Model Performance
  7. Incorporate Socio-Economic Data for Deeper Insights

Without any further delay, let’s get started!

Step 1: Download the Dataset

To build our Crime Rate Prediction  Project, we’ll use publicly available datasets such as those from the FBI Crime Data Explorer and Kaggle. These datasets contain real-world crime statistics across cities, helping us analyze patterns, build predictive models, and explore how social and economic factors influence crime rates.

Follow the steps below to download the dataset:

  1. Open a new tab in your web browser.
  2. Go to: Kaggle
  3. Search for the dataset and click the Download button to download the dataset as a .zip file.
  4. Once downloaded, extract the ZIP file.
  5. We’ll use this CSV file for the project.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Upload, Explore, and Clean the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Once uploaded, import the required libraries and use the following Python code to read and check the data:

# Importing necessary libraries for data manipulation, visualization, and warning control
import pandas as pd                 # For data loading and manipulation (DataFrames)
import numpy as np                  # For numerical operations
import matplotlib.pyplot as plt     # For basic visualizations
import seaborn as sns               # For advanced statistical visualizations
import plotly.express as px         # For interactive plots
import plotly.graph_objects as go   # For more customized Plotly graphs
from plotly.subplots import make_subplots  # For creating multi-plot layouts
import warnings                     # To suppress warning messages
warnings.filterwarnings('ignore')   # Ignoring all warnings for clean output

# Step 1: Load the dataset
df = pd.read_csv('new_dataset.csv')  # Replace 'new_dataset.csv' with the path to your dataset file

# Step 2: Get general information about the dataset
print("Dataset Info:")
print(df.info())  # Displays column names, data types, and non-null values

# Step 3: Preview the first few rows to understand structure
print("\nFirst few rows:")
print(df.head())  # Shows first 5 records in the dataset


# Step 4: Check for missing values in each column
print("\nMissing values:")
print(df.isnull().sum())  # Important for data cleaning steps

Output : 

Dataset Overview 

Column Name

Non-Null Count

Data Type

Year 1520 int64
City 1520 object
Population (in Lakhs) (2011)+ 1520 float64
Type 1520 object
Crime Rate 1520 float64

Total Entries: 1520

Total Columns: 5

Sample Records : 

Year

City

Population (Lakhs)

Type

Crime Rate

2014 Ahmedabad 63.5 Murder 1.291339
2015 Ahmedabad 63.5 Murder 1.480315
2016 Ahmedabad 63.5 Murder 1.622047
2017 Ahmedabad 63.5 Murder 1.417323
2018 Ahmedabad 63.5 Murder 1.543307

 

Missing Values Check :

Column

Missing Values

Year 0
City 0
Population (in Lakhs) (2011)+ 0
Type 0
Crime Rate 0

Conclusion: No missing data detected, the dataset is clean and ready for analysis.

Step 3: Feature Engineering: Making Data Smarter for Prediction

Before feeding data into machine learning models, we need to transform raw values into meaningful features that can help the model learn better. This step is called feature engineering.

In this step, we:

  • Normalize year values to remove scale differences.
  • Log-transform population and crime rate for better distribution.
  • Categorize cities based on population size.
  • Calculate year-over-year changes in the crime rate.
  • Compute rolling averages to smooth out fluctuations.
  • Encode text labels (like city and crime type) into numeric form.

Here is the code:

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Function to generate useful features for ML model
def create_features(df):
    """
    Create engineered features for crime prediction
    """
    df = df.copy()  # Work on a copy to avoid changing the original

    #  Normalize the 'Year' to a 0–1 range
    df['Year_normalized'] = (df['Year'] - df['Year'].min()) / (df['Year'].max() - df['Year'].min())

    #  Population Features
    df['Population_log'] = np.log1p(df['Population (in Lakhs) (2011)+'])  # Log transformation
    df['Population_scaled'] = df['Population (in Lakhs) (2011)+'] / 100   # Rescale for better model behavior

    #  Crime Rate Feature Transformation
    df['Crime_Rate_log'] = np.log1p(df['Crime Rate'])  # Helps reduce skewness

    #  City Categorization based on Population Size
    city_pop_map = df.groupby('City')['Population (in Lakhs) (2011)+'].first().to_dict()
    df['City_Population_Category'] = df['City'].map(city_pop_map)
    df['City_Size'] = pd.cut(
        df['City_Population_Category'],
        bins=[0, 25, 50, 100, float('inf')],
        labels=['Small', 'Medium', 'Large', 'Mega']
    )

    #  Year-over-Year Change in Crime Rate
    df = df.sort_values(['City', 'Type', 'Year'])  # Important for calculating lag
    df['Crime_Rate_lag1'] = df.groupby(['City', 'Type'])['Crime Rate'].shift(1)
    df['Crime_Rate_change'] = df['Crime Rate'] - df['Crime_Rate_lag1']
    df['Crime_Rate_pct_change'] = df['Crime_Rate_change'] / (df['Crime_Rate_lag1'] + 1e-8)

    #  Rolling Average of Crime Rate (3-year window)
    df['Crime_Rate_rolling_3yr'] = (
        df.groupby(['City', 'Type'])['Crime Rate']
        .rolling(window=3, min_periods=1)
        .mean()
        .reset_index(level=[0,1], drop=True)
    )

    # Encode Categorical Variables to Numeric
    le_city = LabelEncoder()
    le_type = LabelEncoder()
    df['City_encoded'] = le_city.fit_transform(df['City'])
    df['Type_encoded'] = le_type.fit_transform(df['Type'])

    return df, le_city, le_type

# Apply Feature Engineering
df_features, city_encoder, type_encoder = create_features(df)

print(" Feature engineering completed!")
print("New features shape:", df_features.shape)

Output: 

Feature engineering completed!

 New features shape: (1520, 17)

This means our dataset now has:

  • 1520 rows (same number of data records as before)
  • 17 columns, which include:
    • The original 5 columns
    • 12 newly engineered features, such as normalized year, log-transformed population, crime rate changes, rolling averages, city size categories, and encoded labels.

Step 4: Crime Rate Prediction Using Regression Models

In this step, we use regression algorithms to predict the crime rate for each city and crime type based on various factors such as population, year, and past crime trends.

What this step includes:

  • Selecting relevant features and the target variable.
  • Handling missing values due to lag-based features.
  • Splitting the dataset into training and test sets.
  • Scaling numeric features for models that require normalization.
  • Training multiple regression models (Linear, Ridge, Lasso, Random Forest, XGBoost).
  • Evaluating each model using RMSE, MAE, and R² metrics.

Here is the code:

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def regression_analysis(df):
    """
    Perform regression analysis for crime rate prediction.
    Trains multiple models and evaluates their performance.
    """
    
    # 1. Select features and target variable
    feature_cols = [
        'Year', 
        'Population (in Lakhs) (2011)+', 
        'City_encoded', 
        'Type_encoded',
        'Population_log', 
        'Year_normalized', 
        'Crime_Rate_lag1', 
        'Crime_Rate_rolling_3yr'
    ]
    
    # 2. Drop rows with missing values (due to lag features)
    df_clean = df.dropna(subset=feature_cols + ['Crime Rate'])
    
    # 3. Define features (X) and target (y)
    X = df_clean[feature_cols]
    y = df_clean['Crime Rate']
    
    # 4. Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=df_clean['Type']
    )
    
    # 5. Normalize features for models that need scaled input
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # 6. Define regression models to evaluate
    regression_models = {
        'Linear Regression': LinearRegression(),
        'Ridge Regression': Ridge(alpha=1.0),
        'Lasso Regression': Lasso(alpha=0.1),
        'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
        'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42)
    }
    
    regression_results = {}

    # 7. Train and evaluate each model
    for name, model in regression_models.items():
        print(f"\nTraining {name}...")
        
        # Use scaled features for linear models
        if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression']:
            model.fit(X_train_scaled, y_train)
            y_pred = model.predict(X_test_scaled)
        else:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
        
        # 8. Calculate evaluation metrics
        mse = mean_squared_error(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        rmse = np.sqrt(mse)
        
        regression_results[name] = {
            'model': model,
            'predictions': y_pred,
            'MSE': mse,
            'RMSE': rmse,
            'MAE': mae,
            'R2': r2
        }
        
        # Print model performance
        print(f"{name} - RMSE: {rmse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")
    
    # 9. Return results and test data for further analysis or plotting
    return regression_results, X_test, y_test, scaler

# Run the regression analysis
regression_results, X_test_reg, y_test_reg, scaler = regression_analysis(df_features)

Output: 

We trained five different regression models to predict crime rates across Indian cities and crime types using engineered features. Here's how each model performed on the test data:

Model

RMSE

MAE

R² Score

Linear Regression 4.1573 1.6639 0.9441
Ridge Regression 4.2200 1.6795 0.9424
Lasso Regression 4.6333 1.8087 0.9305
Random Forest 4.9341 1.9306 0.9212
XGBoost Regressor 4.7356 1.8450 0.9274

 

Insights:

  • Linear Regression did the best job at predicting crime rates. This is a simple model, but it worked very well with our data, even better than some of the advanced models!
  • Ridge Regression also performed well and gave similar results to Linear Regression. It’s good at handling small errors and avoiding overfitting.
  • Lasso Regression was okay, but slightly worse. It’s helpful when we want to remove less important features, but in our case, that wasn’t necessary.
  • Random Forest and XGBoost, which are advanced models, didn’t perform better than the simpler models. This means our data didn’t have complex patterns that required such heavy models — simpler ones worked just fine!
  • All models gave an R² score above 92%, which means they were able to explain most of the changes in the crime rate based on the features we used.

Step 5: Classification Analysis for Predicting Crime Type

In this step, we aim to predict the category of crime (e.g., MurderCyber Crimes, etc.) using supervised machine learning classification techniques.

What We Do in This Step:

  • Define it as a multi-class classification problem
  • Use features like year, population, city, and past crime data
  • Prepare and scale the data
  • Train and compare multiple classification models
  • Evaluate their accuracy and generate performance reports

Here is the code:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import xgboost as xgb

def classification_analysis(df):
    """
    Perform classification analysis for predicting crime type
    """
    # Select important features for classification
    feature_cols = [
        'Year', 'Population (in Lakhs) (2011)+', 'City_encoded', 'Crime Rate',
        'Population_log', 'Year_normalized', 'Crime_Rate_lag1', 'Crime_Rate_rolling_3yr'
    ]
    
    # Drop rows with NaN values due to lag/rolling features
    df_clean = df.dropna(subset=feature_cols + ['Type'])

    # Define feature matrix (X) and target variable (y)
    X = df_clean[feature_cols]
    y = df_clean['Type_encoded']  # Use encoded labels for classification

    # Split data into training and testing sets (80% train, 20% test)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Normalize features (important for some models like logistic regression)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Define classification models
    classification_models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'XGBoost Classifier': xgb.XGBClassifier(n_estimators=100, random_state=42)
    }

    classification_results = {}

    # Train and evaluate each model
    for name, model in classification_models.items():
        print(f"\nTraining {name}...")

        # Fit on scaled data for logistic regression, original for others
        if name == 'Logistic Regression':
            model.fit(X_train_scaled, y_train)
            y_pred = model.predict(X_test_scaled)
            y_pred_proba = model.predict_proba(X_test_scaled)
        else:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            y_pred_proba = model.predict_proba(X_test)

        # Evaluate accuracy
        accuracy = accuracy_score(y_test, y_pred)

        # Store results
        classification_results[name] = {
            'model': model,
            'predictions': y_pred,
            'probabilities': y_pred_proba,
            'accuracy': accuracy
        }

        # Print classification performance report
        print(f"{name} Accuracy: {accuracy:.4f}")
        print(f"Classification Report:\n{classification_report(y_test, y_pred)}")

    return classification_results, X_test, y_test, scaler

# Run classification analysis
classification_results, X_test_class, y_test_class, class_scaler = classification_analysis(df_features)

Output:

Training Logistic Regression...

Logistic Regression Accuracy: 0.2707

Classification Report:

              precision    recall  f1-score   support

           0       0.23      0.19      0.21        26

           1       0.21      0.19      0.20        26

           2       0.34      0.96      0.50        27

           3       0.16      0.11      0.13        27

           4       0.11      0.07      0.09        27

           5       0.45      0.54      0.49        26

           6       0.30      0.26      0.28        27

           7       0.26      0.19      0.22        27

           8       0.10      0.08      0.09        26

           9       0.25      0.11      0.15        27

 

    accuracy                           0.27       266

   macro avg       0.24      0.27      0.24       266

weighted avg       0.24      0.27      0.24       266

Training Random Forest...

Random Forest Accuracy: 0.4737

Classification Report:

              precision    recall  f1-score   support

           0       0.62      0.38      0.48        26

           1       0.60      0.69      0.64        26

           2       0.75      0.89      0.81        27

           3       0.43      0.33      0.38        27

           4       0.33      0.37      0.35        27

           5       0.46      0.42      0.44        26

           6       0.32      0.22      0.26        27

           7       0.37      0.48      0.42        27

           8       0.19      0.19      0.19        26

           9       0.61      0.74      0.67        27

 

    accuracy                           0.47       266

   macro avg       0.47      0.47      0.46       266

weighted avg       0.47      0.47      0.46       266

Training XGBoost Classifier...

XGBoost Classifier Accuracy: 0.4586

Classification Report:

              precision    recall  f1-score   support

           0       0.32      0.23      0.27        26

           1       0.65      0.65      0.65        26

           2       0.75      0.89      0.81        27

           3       0.25      0.22      0.24        27

           4       0.33      0.37      0.35        27

           5       0.50      0.50      0.50        26

           6       0.26      0.22      0.24        27

           7       0.38      0.41      0.39        27

           8       0.38      0.38      0.38        26

           9       0.61      0.70      0.66        27

 

    accuracy                           0.46       266

   macro avg       0.44      0.46      0.45       266

weighted avg       0.44      0.46      0.45       266

Summary of the Output:

Model

Accuracy

Best Predicted Classes

Poorly Predicted Classes

Notes

Logistic Regression 0.2707 Class 2, Class 5 Most other classes Performs poorly overall, only Class 2 shows strong recall (0.96).
Random Forest 0.4737 Class 1, 2, 9 Class 6, 8 Balanced performance; better than logistic regression.
XGBoost Classifier 0.4586 Class 1, 2, 9 Class 3, 6 Similar to Random Forest with slightly better performance on Class 8.

 

Step 6: Spatial Visualization of Crime Data

In this section, we explore how crime varies geographically across Indian cities using interactive maps and bar plots. Specifically, we:

  • Map cities to their respective latitude and longitude coordinates.
  • Visualize how crime rates change over time in different cities using animated maps.
  • Show average crime rates per city using a choropleth-style map.
  • Display how different crime types are distributed across cities via bar charts.

These visualizations help us detect regional crime patterns and understand spatial trends.

Here is the Code:

# Dictionary containing latitude and longitude for each city
city_coordinates = {
    'Ahmedabad': [23.0225, 72.5714],
    'Bengaluru': [12.9716, 77.5946],
    'Chennai': [13.0827, 80.2707],
    'Coimbatore': [11.0168, 76.9558],
    'Delhi': [28.7041, 77.1025],
    'Ghaziabad': [28.6692, 77.4538],
    'Hyderabad': [17.3850, 78.4867],
    'Indore': [22.7196, 75.8577],
    'Jaipur': [26.9124, 75.7873],
    'Kanpur': [26.4499, 80.3319],
    'Kochi': [9.9312, 76.2673],
    'Kolkata': [22.5726, 88.3639],
    'Kozhikode': [11.2588, 75.7804],
    'Lucknow': [26.8467, 80.9462],
    'Mumbai': [19.0760, 72.8777],
    'Nagpur': [21.1458, 79.0882],
    'Patna': [25.5941, 85.1376],
    'Pune': [18.5204, 73.8567],
    'Surat': [21.1702, 72.8311]
}

def create_spatial_visualizations(df):
    """
    Generate spatial visualizations of crime data across Indian cities.
    """
    # Create a copy of the dataframe to avoid modifying the original
    df_spatial = df.copy()

    # Add Latitude and Longitude columns using the city coordinates
    df_spatial['Latitude'] = df_spatial['City'].map(lambda x: city_coordinates[x][0])
    df_spatial['Longitude'] = df_spatial['City'].map(lambda x: city_coordinates[x][1])

    # Aggregate crime rate and population by city and year
    city_year_agg = df_spatial.groupby(['City', 'Year', 'Latitude', 'Longitude']).agg({
        'Crime Rate': 'mean',
        'Population (in Lakhs) (2011)+': 'first'
    }).reset_index()

    # Create an animated scatter map showing crime rate changes over years
    fig = px.scatter_mapbox(
        city_year_agg,
        lat='Latitude',
        lon='Longitude',
        size='Crime Rate',
        color='Crime Rate',
        hover_name='City',
        hover_data=['Year', 'Population (in Lakhs) (2011)+'],
        animation_frame='Year',
        title='Crime Rates Across Indian Cities Over Time',
        mapbox_style='open-street-map',
        height=600,
        color_continuous_scale='Reds'
    )
    fig.update_layout(mapbox_zoom=4, mapbox_center_lat=20, mapbox_center_lon=77)
    fig.show()

    # Create a static scatter map showing average crime rate per city (choropleth-style)
    crime_by_city = df_spatial.groupby('City').agg({
        'Crime Rate': 'mean',
        'Latitude': 'first',
        'Longitude': 'first'
    }).reset_index()

    fig2 = px.scatter_mapbox(
        crime_by_city,
        lat='Latitude',
        lon='Longitude',
        size='Crime Rate',
        color='Crime Rate',
        hover_name='City',
        title='Average Crime Rates by City (2014–2021)',
        mapbox_style='open-street-map',
        height=600,
        color_continuous_scale='Reds'
    )
    fig2.update_layout(mapbox_zoom=4, mapbox_center_lat=20, mapbox_center_lon=77)
    fig2.show()

    # Group data to show distribution of crime types in each city
    crime_type_city = df_spatial.groupby(['City', 'Type']).agg({
        'Crime Rate': 'mean'
    }).reset_index()

    # Bar plot showing crime type distribution per city
    fig3 = px.bar(
        crime_type_city,
        x='City',
        y='Crime Rate',
        color='Type',
        title='Crime Type Distribution Across Cities',
        height=600
    )
    fig3.update_xaxes(tickangle=45)
    fig3.show()

# Call the function to create visualizations
create_spatial_visualizations(df_features)

Output:

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Note- The charts above are originally interactive and dynamic. However, they are shown here as static images for display purposes. In a real project or dashboard, you can hover, zoom, and filter data directly on these plots for deeper exploration.

Step 7: Analyze socio-economic trends and their impact on crime

In this step, we explore how various socioeconomic factors correlate with crime patterns across cities and time. This visual analysis helps uncover patterns and guide data-driven policy insights.

We perform the following analyses:

  • Population vs. Crime Rate: Identify whether larger cities experience higher crime rates.
  • Yearly Crime Trends: Understand whether crime is increasing or decreasing over the years.
  • Crime Type Distribution: Reveal which types of crime are more prevalent.
  • Top 5 Most Affected Cities: Highlight cities with the highest crime rates.
def analyze_socioeconomic_trends(df):
    """
    Analyze socio-economic trends and their impact on crime using visualizations.
    """
    # Set the overall figure size
    plt.figure(figsize=(15, 10))
    
    # -------------------------------
    # Subplot 1: Population vs Crime Rate
    # -------------------------------
    plt.subplot(2, 3, 1)
    
    # Group data to get one population value and average crime rate per city
    city_stats = df.groupby('City').agg({
        'Population (in Lakhs) (2011)+': 'first',
        'Crime Rate': 'mean'
    }).reset_index()
    
    # Scatter plot to see correlation between population and crime rate
    plt.scatter(city_stats['Population (in Lakhs) (2011)+'], city_stats['Crime Rate'])
    plt.xlabel('Population (in Lakhs)')
    plt.ylabel('Average Crime Rate')
    plt.title('Population vs Crime Rate')

    # Add city names next to their points
    for i, row in city_stats.iterrows():
        plt.annotate(row['City'], (row['Population (in Lakhs) (2011)+'], row['Crime Rate']), 
                    fontsize=8, alpha=0.7)
    
    # -------------------------------
    # Subplot 2: Overall Crime Rate Trend Over Time
    # -------------------------------
    plt.subplot(2, 3, 2)
    
    # Average crime rate per year across all cities
    yearly_trend = df.groupby('Year')['Crime Rate'].mean()
    
    # Line plot to show trend
    plt.plot(yearly_trend.index, yearly_trend.values, marker='o')
    plt.xlabel('Year')
    plt.ylabel('Average Crime Rate')
    plt.title('Overall Crime Rate Trend')

    # -------------------------------
    # Subplot 3: Crime Type Distribution
    # -------------------------------
    plt.subplot(2, 3, 3)
    
    # Average crime rate per crime type
    crime_type_avg = df.groupby('Type')['Crime Rate'].mean().sort_values(ascending=False)
    
    # Bar plot to compare average crime rates across types
    plt.bar(range(len(crime_type_avg)), crime_type_avg.values)
    plt.xticks(range(len(crime_type_avg)), crime_type_avg.index, rotation=45)
    plt.ylabel('Average Crime Rate')
    plt.title('Average Crime Rate by Type')

    # -------------------------------
    # Subplot 4: Top 5 Cities by Crime Rate
    # -------------------------------
    plt.subplot(2, 3, 4)
    
    # Select top 5 cities based on average crime rate
    top_cities = city_stats.nlargest(5, 'Crime Rate')
    
    # Bar plot for top 5 cities
    plt.bar(top_cities['City'], top_cities['Crime Rate'])
    plt.xticks(rotation=45)
    plt.ylabel('Average Crime Rate')
    plt.title('Top 5 Cities by Crime Rate')

    # -------------------------------
    # Subplot 5: Crime Rate by City Size Category (if column exists)
    # -------------------------------
    plt.subplot(2, 3, 5)
    
    if 'City_Size' in df.columns:
        # Average crime rate for each city size category (e.g., metro, small, medium)
        size_crime = df.groupby('City_Size')['Crime Rate'].mean()
        
        # Bar plot to compare across city sizes
        plt.bar(size_crime.index, size_crime.values)
        plt.xlabel('City Size Category')
        plt.ylabel('Average Crime Rate')
        plt.title('Crime Rate by City Size')

    # -------------------------------
    # Subplot 6: Year-over-Year % Change in Crime Rate (if column exists)
    # -------------------------------
    plt.subplot(2, 3, 6)
    
    if 'Crime_Rate_pct_change' in df.columns:
        # Average yearly percentage change across all cities
        yearly_change = df.groupby('Year')['Crime_Rate_pct_change'].mean().dropna()
        
        # Line plot showing trend in change rate
        plt.plot(yearly_change.index, yearly_change.values, marker='o', color='red')
        plt.xlabel('Year')
        plt.ylabel('Average % Change in Crime Rate')
        plt.title('Year-over-Year Crime Rate Changes')
        plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)

    # Adjust spacing between plots
    plt.tight_layout()
    plt.show()

# Run the analysis function
analyze_socioeconomic_trends(df_features)

Output:

Step 8: Comprehensive model evaluation and comparison

In this final step, we evaluate and compare the performance of both regression and classification models. This allows us to make informed decisions about which models are best suited for predicting crime rates or classifying risk levels.

What we perform here:

  • Tabular comparison of regression models using RMSE, MAE, and R² scores.
  • Tabular comparison of classification models using accuracy.
  • Bar chart visualization of error and accuracy metrics for quick visual comparison.
  • Return metrics tables for further analysis or reporting.
def evaluate_and_compare_models(regression_results, classification_results):
    """
    Create comprehensive model evaluation and comparison
    for both regression and classification models.
    """
    
    # -----------------------------------------
    # Regression Model Metrics Table
    # -----------------------------------------
    # Convert regression metrics to DataFrame
    reg_metrics_df = pd.DataFrame({
        model_name: {
            'RMSE': results['RMSE'],
            'MAE': results['MAE'],
            'R²': results['R2']
        }
        for model_name, results in regression_results.items()
    }).T  # Transpose to have model names as rows
    
    print("Regression Model Performance:")
    print(reg_metrics_df.round(4))  # Display metrics with 4 decimal precision

    # -----------------------------------------
    # Classification Model Metrics Table
    # -----------------------------------------
    # Convert classification accuracy to DataFrame
    class_metrics_df = pd.DataFrame({
        model_name: {
            'Accuracy': results['accuracy']
        }
        for model_name, results in classification_results.items()
    }).T
    
    print("\nClassification Model Performance:")
    print(class_metrics_df.round(4))  # Display accuracy with 4 decimal precision

    # -----------------------------------------
    # Visualization: Regression and Classification Metrics
    # -----------------------------------------
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))  # Create 1x2 grid for plots

    # Bar plot for regression metrics (RMSE and MAE)
    reg_metrics_df[['RMSE', 'MAE']].plot(kind='bar', ax=axes[0])
    axes[0].set_title('Regression Model Performance (Lower is Better)')
    axes[0].set_ylabel('Error Metrics')
    axes[0].tick_params(axis='x', rotation=45)  # Rotate x-axis labels for readability

    # Bar plot for classification model accuracy
    class_metrics_df['Accuracy'].plot(kind='bar', ax=axes[1], color='green')
    axes[1].set_title('Classification Model Accuracy (Higher is Better)')
    axes[1].set_ylabel('Accuracy')
    axes[1].tick_params(axis='x', rotation=45)

    plt.tight_layout()  # Adjust layout to prevent overlap
    plt.show()

    # Return DataFrames for future use
    return reg_metrics_df, class_metrics_df

# Call the evaluation function
reg_comparison, class_comparison = evaluate_and_compare_models(regression_results, classification_results)

Output:

Regression Model Performance:

Model

RMSE

MAE

Linear Regression 4.1573 1.6639 0.9441
Ridge Regression 4.2200 1.6795 0.9424
Lasso Regression 4.6333 1.8087 0.9305
Random Forest 4.9341 1.9306 0.9212
XGBoost 4.7356 1.8450 0.9274

 

Classification Model Performance:

Model

Accuracy

Logistic Regression 0.2707
Random Forest 0.4737
XGBoost Classifier 0.4586

In this project, our goal was to:

  1. Predict future crime numbers (regression task)
  2. Classify cities as low/high crime risk (classification task)

We trained and tested different machine learning models and compared their performance to find the best ones.

Takeaway:

  • Linear Regression is the best choice to predict crime numbers. It gives the most reliable and accurate results.
  • Random Forest Classifier is the best option to categorize cities as high or low crime risk. XGBoost is also good, but slightly less accurate.

Step 9: Comprehensive project report

In this step, we generate a complete summary report for our Crime Rate Prediction Project.
This final report brings everything together in a readable format.

Here's what we include in the report:

  • A quick overview of the dataset (size, timeline, city, and crime type coverage)
  • Key insights from data exploration (like most dangerous city/type, trends, and correlation)
  • Model performance summary for both regression and classification tasks
  • Best models based on metrics (RMSE and Accuracy)

Here is the code:

def generate_project_report(df, regression_results, classification_results):
    """
    Generate a comprehensive report for the Crime Rate Prediction Project.
    
    The report includes:
    - Dataset summary
    - Key insights and trends
    - Regression and classification model performance
    - Best performing models
    """

    print("=" * 80)
    print("CRIME RATE PREDICTION PROJECT - COMPREHENSIVE REPORT")
    print("=" * 80)

    # Dataset overview: Size, time span, city count, crime types
    print("\n1. DATASET OVERVIEW:")
    print(f"   • Total Records: {len(df):,}")
    print(f"   • Time Period: {df['Year'].min()} - {df['Year'].max()}")
    print(f"   • Number of Cities: {df['City'].nunique()}")
    print(f"   • Crime Types Analyzed: {df['Type'].nunique()}")
    print(f"   • Crime Types: {', '.join(df['Type'].unique())}")

    # Key findings section starts here
    print("\n2. KEY FINDINGS:")

    # Correlation between population and crime rate
    pop_crime_corr = df.groupby('City').agg({
        'Population (in Lakhs) (2011)+': 'first',
        'Crime Rate': 'mean'
    })
    correlation = pop_crime_corr['Population (in Lakhs) (2011)+'].corr(pop_crime_corr['Crime Rate'])
    print(f"   • Population-Crime Rate Correlation: {correlation:.3f}")

    # City with the highest average crime rate
    avg_by_city = df.groupby('City')['Crime Rate'].mean()
    highest_crime_city = avg_by_city.idxmax()
    highest_crime_rate = avg_by_city.max()
    print(f"   • Highest Average Crime Rate City: {highest_crime_city} ({highest_crime_rate:.2f})")

    # Most common (prevalent) crime type
    avg_by_type = df.groupby('Type')['Crime Rate'].mean()
    highest_crime_type = avg_by_type.idxmax()
    highest_type_rate = avg_by_type.max()
    print(f"   • Most Prevalent Crime Type: {highest_crime_type} ({highest_type_rate:.2f})")

    # Trend in crime rate from start to end year
    yearly_change = df.groupby('Year')['Crime Rate'].mean()
    overall_trend = "Increasing" if yearly_change.iloc[-1] > yearly_change.iloc[0] else "Decreasing"
    print(f"   • Overall Crime Trend ({df['Year'].min()}-{df['Year'].max()}): {overall_trend}")

    # Model performance summary
    print("\n3. MODEL PERFORMANCE SUMMARY:")

    # Best regression model based on lowest RMSE
    best_reg_model = min(regression_results.items(), key=lambda x: x[1]['RMSE'])
    print(f"   • Best Regression Model: {best_reg_model[0]}")
    print(f"     - RMSE: {best_reg_model[1]['RMSE']:.4f}")
    print(f"     - R²: {best_reg_model[1]['R2']:.4f}")

    # Best classification model based on highest accuracy
    best_class_model = max(classification_results.items(), key=lambda x: x[1]['accuracy'])
    print(f"   • Best Classification Model: {best_class_model[0]}")
    print(f"     - Accuracy: {best_class_model[1]['accuracy']:.4f}")

Output:

​​Dataset Overview:

Detail

Value

Total Records 1,520
Time-Period 2014 - 2021
Number of Cities 19
Crime Types Analyzed 10
Crime Types Crime Committed by Juveniles, Crime against SC, Crime against ST, Crime against Senior Citizen, Crime against children, Crime against women, Cyber Crimes, Economic Offences, Kidnapping, Murder

Key Insights from the Data

  • Population vs Crime Rate Correlation: -0.047 (very weak negative relationship)
  • City with Highest Average Crime Rate: Jaipur (31.56)
  • Most Prevalent Crime Type: Crime against women (36.55)
  • Overall Crime Trend (2014–2021): Increasing

Model Performance Summary:

Regression Models (Predicting exact crime rates)

Model

RMSE

Linear Regression 4.1573 0.9441
Ridge Regression 4.2200 0.9424
Lasso Regression 4.6333 0.9305
Random Forest 4.9341 0.9212
XGBoost 4.7356 0.9274

Best Regression Model: Linear Regression (Highest R² and lowest RMSE)

Classification Models (Classifying crime rate as High/Low)

Model

Accuracy

Logistic Regression 0.2707
Random Forest 0.4737
XGBoost Classifier 0.4586

Best Classification Model: Random Forest (Highest accuracy)

Final Conclusion: What We Learned from the Crime Rate Prediction Project

This crime rate prediction project not only provided valuable insights into crime patterns across Indian cities but also served as a hands-on demonstration of essential data science skills and their real-world application. Here’s what we gained from this comprehensive analysis:

 Skills Demonstrated

  • Feature Engineering: Created lag variables, rolling averages, and applied categorical encoding to enrich the dataset and improve model performance.
  • Regression Analysis: Explored multiple regression models to predict exact crime rates, with Linear Regression emerging as the most accurate.
  • Classification Techniques: Built and compared classification models to label cities based on crime severity levels.
  • Spatial Data Visualization: Used interactive maps to visualize city-wise crime rates for better geographical interpretation.
  • Socio-economic Impact Analysis: Explored how factors like population size correlate (or don’t) with crime, aiding policy insights.
  • Model Evaluation: Compared model performance using metrics like RMSE, R², and Accuracy to choose the best-fit models for prediction tasks.

In short, this Crime Rate Prediction Project showcased a full data pipeline, from raw ingestion to insights and forecasting, using scalable and industry-relevant tools, making it a strong technical case study for aspiring data scientists.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1W2yhfks2GibtFilELHL5mT3LfdN8yEqC?usp=sharing

Frequently Asked Questions (FAQs)

1. What kind of data was used in this project?

2. Why did we use both regression and classification models?

3. How accurate were the models?

4. What insights can policymakers gain from this analysis?

5. Can this model be used in the future?

Rohit Sharma

804 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months