Crime Rate Prediction by City Using Python and Machine Learning
By Rohit Sharma
Updated on Jul 28, 2025 | 21 min read | 1.18K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 28, 2025 | 21 min read | 1.18K+ views
Share:
Table of Contents
Predicting crime rates using data science is both a challenging and impactful task. In this project, we will explore how machine learning and real-world data can help us understand crime trends across different cities.
We'll work with publicly available datasets from sources like the FBI and UCI, using tools such as Pandas, NumPy, and Scikit-learn. By applying regression and classification models, performing feature engineering, and visualizing spatial data, we aim to uncover the social and economic factors that influence crime.
Popular Data Science Programs
Ignite your next big idea with our expertly curated collection of Python-based data science projects, perfect for sharpening your skills and building real-world experience.
It’s helpful to have some basic knowledge of the following before starting this Crime Rate Prediction project:
Start your data science career journey with upGrad’s top-ranked courses and gain the opportunity to learn directly from experienced industry mentors.
For the Crime Rate Prediction Project, we used the following tools and libraries to gather, process, model, and visualize the data effectively:
Tool / Library |
Purpose |
Python | Core programming language for scripting, analysis, and model building |
Pandas | Data loading, cleaning, manipulation, and feature engineering |
NumPy | Numerical operations and efficient array handling |
Matplotlib | Creating basic plots and visual insights into data trends |
Seaborn | Advanced statistical visualizations for correlation and distribution plots |
Scikit-learn | Building and evaluating regression/classification models |
Jupyter/Colab | Interactive coding and experimentation environment |
We’ll use a few straightforward yet powerful techniques to understand and predict crime for the Crime Rate Prediction Project:
You can complete this Crime Rate Prediction Project in approximately 2 to 3 hours. The difficulty is rated as Moderate, ideal for learners with basic Python knowledge and an interest in applying machine learning to real-world crime data.
Let’s start building the project from scratch. We'll go step-by-step through the process of:
Without any further delay, let’s get started!
To build our Crime Rate Prediction Project, we’ll use publicly available datasets such as those from the FBI Crime Data Explorer and Kaggle. These datasets contain real-world crime statistics across cities, helping us analyze patterns, build predictive models, and explore how social and economic factors influence crime rates.
Follow the steps below to download the dataset:
Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Once uploaded, import the required libraries and use the following Python code to read and check the data:
# Importing necessary libraries for data manipulation, visualization, and warning control
import pandas as pd # For data loading and manipulation (DataFrames)
import numpy as np # For numerical operations
import matplotlib.pyplot as plt # For basic visualizations
import seaborn as sns # For advanced statistical visualizations
import plotly.express as px # For interactive plots
import plotly.graph_objects as go # For more customized Plotly graphs
from plotly.subplots import make_subplots # For creating multi-plot layouts
import warnings # To suppress warning messages
warnings.filterwarnings('ignore') # Ignoring all warnings for clean output
# Step 1: Load the dataset
df = pd.read_csv('new_dataset.csv') # Replace 'new_dataset.csv' with the path to your dataset file
# Step 2: Get general information about the dataset
print("Dataset Info:")
print(df.info()) # Displays column names, data types, and non-null values
# Step 3: Preview the first few rows to understand structure
print("\nFirst few rows:")
print(df.head()) # Shows first 5 records in the dataset
# Step 4: Check for missing values in each column
print("\nMissing values:")
print(df.isnull().sum()) # Important for data cleaning steps
Output :
Dataset Overview
Column Name |
Non-Null Count |
Data Type |
Year | 1520 | int64 |
City | 1520 | object |
Population (in Lakhs) (2011)+ | 1520 | float64 |
Type | 1520 | object |
Crime Rate | 1520 | float64 |
Total Entries: 1520
Total Columns: 5
Sample Records :
Year |
City |
Population (Lakhs) |
Type |
Crime Rate |
2014 | Ahmedabad | 63.5 | Murder | 1.291339 |
2015 | Ahmedabad | 63.5 | Murder | 1.480315 |
2016 | Ahmedabad | 63.5 | Murder | 1.622047 |
2017 | Ahmedabad | 63.5 | Murder | 1.417323 |
2018 | Ahmedabad | 63.5 | Murder | 1.543307 |
Missing Values Check :
Column |
Missing Values |
Year | 0 |
City | 0 |
Population (in Lakhs) (2011)+ | 0 |
Type | 0 |
Crime Rate | 0 |
Conclusion: No missing data detected, the dataset is clean and ready for analysis.
Before feeding data into machine learning models, we need to transform raw values into meaningful features that can help the model learn better. This step is called feature engineering.
In this step, we:
Here is the code:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# Function to generate useful features for ML model
def create_features(df):
"""
Create engineered features for crime prediction
"""
df = df.copy() # Work on a copy to avoid changing the original
# Normalize the 'Year' to a 0–1 range
df['Year_normalized'] = (df['Year'] - df['Year'].min()) / (df['Year'].max() - df['Year'].min())
# Population Features
df['Population_log'] = np.log1p(df['Population (in Lakhs) (2011)+']) # Log transformation
df['Population_scaled'] = df['Population (in Lakhs) (2011)+'] / 100 # Rescale for better model behavior
# Crime Rate Feature Transformation
df['Crime_Rate_log'] = np.log1p(df['Crime Rate']) # Helps reduce skewness
# City Categorization based on Population Size
city_pop_map = df.groupby('City')['Population (in Lakhs) (2011)+'].first().to_dict()
df['City_Population_Category'] = df['City'].map(city_pop_map)
df['City_Size'] = pd.cut(
df['City_Population_Category'],
bins=[0, 25, 50, 100, float('inf')],
labels=['Small', 'Medium', 'Large', 'Mega']
)
# Year-over-Year Change in Crime Rate
df = df.sort_values(['City', 'Type', 'Year']) # Important for calculating lag
df['Crime_Rate_lag1'] = df.groupby(['City', 'Type'])['Crime Rate'].shift(1)
df['Crime_Rate_change'] = df['Crime Rate'] - df['Crime_Rate_lag1']
df['Crime_Rate_pct_change'] = df['Crime_Rate_change'] / (df['Crime_Rate_lag1'] + 1e-8)
# Rolling Average of Crime Rate (3-year window)
df['Crime_Rate_rolling_3yr'] = (
df.groupby(['City', 'Type'])['Crime Rate']
.rolling(window=3, min_periods=1)
.mean()
.reset_index(level=[0,1], drop=True)
)
# Encode Categorical Variables to Numeric
le_city = LabelEncoder()
le_type = LabelEncoder()
df['City_encoded'] = le_city.fit_transform(df['City'])
df['Type_encoded'] = le_type.fit_transform(df['Type'])
return df, le_city, le_type
# Apply Feature Engineering
df_features, city_encoder, type_encoder = create_features(df)
print(" Feature engineering completed!")
print("New features shape:", df_features.shape)
Output:
Feature engineering completed!
New features shape: (1520, 17)
This means our dataset now has:
In this step, we use regression algorithms to predict the crime rate for each city and crime type based on various factors such as population, year, and past crime trends.
What this step includes:
Here is the code:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
def regression_analysis(df):
"""
Perform regression analysis for crime rate prediction.
Trains multiple models and evaluates their performance.
"""
# 1. Select features and target variable
feature_cols = [
'Year',
'Population (in Lakhs) (2011)+',
'City_encoded',
'Type_encoded',
'Population_log',
'Year_normalized',
'Crime_Rate_lag1',
'Crime_Rate_rolling_3yr'
]
# 2. Drop rows with missing values (due to lag features)
df_clean = df.dropna(subset=feature_cols + ['Crime Rate'])
# 3. Define features (X) and target (y)
X = df_clean[feature_cols]
y = df_clean['Crime Rate']
# 4. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df_clean['Type']
)
# 5. Normalize features for models that need scaled input
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 6. Define regression models to evaluate
regression_models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'Lasso Regression': Lasso(alpha=0.1),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42)
}
regression_results = {}
# 7. Train and evaluate each model
for name, model in regression_models.items():
print(f"\nTraining {name}...")
# Use scaled features for linear models
if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression']:
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# 8. Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)
regression_results[name] = {
'model': model,
'predictions': y_pred,
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'R2': r2
}
# Print model performance
print(f"{name} - RMSE: {rmse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")
# 9. Return results and test data for further analysis or plotting
return regression_results, X_test, y_test, scaler
# Run the regression analysis
regression_results, X_test_reg, y_test_reg, scaler = regression_analysis(df_features)
Output:
We trained five different regression models to predict crime rates across Indian cities and crime types using engineered features. Here's how each model performed on the test data:
Model |
RMSE |
MAE |
R² Score |
Linear Regression | 4.1573 | 1.6639 | 0.9441 |
Ridge Regression | 4.2200 | 1.6795 | 0.9424 |
Lasso Regression | 4.6333 | 1.8087 | 0.9305 |
Random Forest | 4.9341 | 1.9306 | 0.9212 |
XGBoost Regressor | 4.7356 | 1.8450 | 0.9274 |
Insights:
In this step, we aim to predict the category of crime (e.g., Murder, Cyber Crimes, etc.) using supervised machine learning classification techniques.
What We Do in This Step:
Here is the code:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import xgboost as xgb
def classification_analysis(df):
"""
Perform classification analysis for predicting crime type
"""
# Select important features for classification
feature_cols = [
'Year', 'Population (in Lakhs) (2011)+', 'City_encoded', 'Crime Rate',
'Population_log', 'Year_normalized', 'Crime_Rate_lag1', 'Crime_Rate_rolling_3yr'
]
# Drop rows with NaN values due to lag/rolling features
df_clean = df.dropna(subset=feature_cols + ['Type'])
# Define feature matrix (X) and target variable (y)
X = df_clean[feature_cols]
y = df_clean['Type_encoded'] # Use encoded labels for classification
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Normalize features (important for some models like logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define classification models
classification_models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'XGBoost Classifier': xgb.XGBClassifier(n_estimators=100, random_state=42)
}
classification_results = {}
# Train and evaluate each model
for name, model in classification_models.items():
print(f"\nTraining {name}...")
# Fit on scaled data for logistic regression, original for others
if name == 'Logistic Regression':
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Store results
classification_results[name] = {
'model': model,
'predictions': y_pred,
'probabilities': y_pred_proba,
'accuracy': accuracy
}
# Print classification performance report
print(f"{name} Accuracy: {accuracy:.4f}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
return classification_results, X_test, y_test, scaler
# Run classification analysis
classification_results, X_test_class, y_test_class, class_scaler = classification_analysis(df_features)
Output:
Training Logistic Regression...
Logistic Regression Accuracy: 0.2707
Classification Report:
precision recall f1-score support
0 0.23 0.19 0.21 26
1 0.21 0.19 0.20 26
2 0.34 0.96 0.50 27
3 0.16 0.11 0.13 27
4 0.11 0.07 0.09 27
5 0.45 0.54 0.49 26
6 0.30 0.26 0.28 27
7 0.26 0.19 0.22 27
8 0.10 0.08 0.09 26
9 0.25 0.11 0.15 27
accuracy 0.27 266
macro avg 0.24 0.27 0.24 266
weighted avg 0.24 0.27 0.24 266
Training Random Forest...
Random Forest Accuracy: 0.4737
Classification Report:
precision recall f1-score support
0 0.62 0.38 0.48 26
1 0.60 0.69 0.64 26
2 0.75 0.89 0.81 27
3 0.43 0.33 0.38 27
4 0.33 0.37 0.35 27
5 0.46 0.42 0.44 26
6 0.32 0.22 0.26 27
7 0.37 0.48 0.42 27
8 0.19 0.19 0.19 26
9 0.61 0.74 0.67 27
accuracy 0.47 266
macro avg 0.47 0.47 0.46 266
weighted avg 0.47 0.47 0.46 266
Training XGBoost Classifier...
XGBoost Classifier Accuracy: 0.4586
Classification Report:
precision recall f1-score support
0 0.32 0.23 0.27 26
1 0.65 0.65 0.65 26
2 0.75 0.89 0.81 27
3 0.25 0.22 0.24 27
4 0.33 0.37 0.35 27
5 0.50 0.50 0.50 26
6 0.26 0.22 0.24 27
7 0.38 0.41 0.39 27
8 0.38 0.38 0.38 26
9 0.61 0.70 0.66 27
accuracy 0.46 266
macro avg 0.44 0.46 0.45 266
weighted avg 0.44 0.46 0.45 266
Summary of the Output:
Model |
Accuracy |
Best Predicted Classes |
Poorly Predicted Classes |
Notes |
Logistic Regression | 0.2707 | Class 2, Class 5 | Most other classes | Performs poorly overall, only Class 2 shows strong recall (0.96). |
Random Forest | 0.4737 | Class 1, 2, 9 | Class 6, 8 | Balanced performance; better than logistic regression. |
XGBoost Classifier | 0.4586 | Class 1, 2, 9 | Class 3, 6 | Similar to Random Forest with slightly better performance on Class 8. |
In this section, we explore how crime varies geographically across Indian cities using interactive maps and bar plots. Specifically, we:
These visualizations help us detect regional crime patterns and understand spatial trends.
Here is the Code:
# Dictionary containing latitude and longitude for each city
city_coordinates = {
'Ahmedabad': [23.0225, 72.5714],
'Bengaluru': [12.9716, 77.5946],
'Chennai': [13.0827, 80.2707],
'Coimbatore': [11.0168, 76.9558],
'Delhi': [28.7041, 77.1025],
'Ghaziabad': [28.6692, 77.4538],
'Hyderabad': [17.3850, 78.4867],
'Indore': [22.7196, 75.8577],
'Jaipur': [26.9124, 75.7873],
'Kanpur': [26.4499, 80.3319],
'Kochi': [9.9312, 76.2673],
'Kolkata': [22.5726, 88.3639],
'Kozhikode': [11.2588, 75.7804],
'Lucknow': [26.8467, 80.9462],
'Mumbai': [19.0760, 72.8777],
'Nagpur': [21.1458, 79.0882],
'Patna': [25.5941, 85.1376],
'Pune': [18.5204, 73.8567],
'Surat': [21.1702, 72.8311]
}
def create_spatial_visualizations(df):
"""
Generate spatial visualizations of crime data across Indian cities.
"""
# Create a copy of the dataframe to avoid modifying the original
df_spatial = df.copy()
# Add Latitude and Longitude columns using the city coordinates
df_spatial['Latitude'] = df_spatial['City'].map(lambda x: city_coordinates[x][0])
df_spatial['Longitude'] = df_spatial['City'].map(lambda x: city_coordinates[x][1])
# Aggregate crime rate and population by city and year
city_year_agg = df_spatial.groupby(['City', 'Year', 'Latitude', 'Longitude']).agg({
'Crime Rate': 'mean',
'Population (in Lakhs) (2011)+': 'first'
}).reset_index()
# Create an animated scatter map showing crime rate changes over years
fig = px.scatter_mapbox(
city_year_agg,
lat='Latitude',
lon='Longitude',
size='Crime Rate',
color='Crime Rate',
hover_name='City',
hover_data=['Year', 'Population (in Lakhs) (2011)+'],
animation_frame='Year',
title='Crime Rates Across Indian Cities Over Time',
mapbox_style='open-street-map',
height=600,
color_continuous_scale='Reds'
)
fig.update_layout(mapbox_zoom=4, mapbox_center_lat=20, mapbox_center_lon=77)
fig.show()
# Create a static scatter map showing average crime rate per city (choropleth-style)
crime_by_city = df_spatial.groupby('City').agg({
'Crime Rate': 'mean',
'Latitude': 'first',
'Longitude': 'first'
}).reset_index()
fig2 = px.scatter_mapbox(
crime_by_city,
lat='Latitude',
lon='Longitude',
size='Crime Rate',
color='Crime Rate',
hover_name='City',
title='Average Crime Rates by City (2014–2021)',
mapbox_style='open-street-map',
height=600,
color_continuous_scale='Reds'
)
fig2.update_layout(mapbox_zoom=4, mapbox_center_lat=20, mapbox_center_lon=77)
fig2.show()
# Group data to show distribution of crime types in each city
crime_type_city = df_spatial.groupby(['City', 'Type']).agg({
'Crime Rate': 'mean'
}).reset_index()
# Bar plot showing crime type distribution per city
fig3 = px.bar(
crime_type_city,
x='City',
y='Crime Rate',
color='Type',
title='Crime Type Distribution Across Cities',
height=600
)
fig3.update_xaxes(tickangle=45)
fig3.show()
# Call the function to create visualizations
create_spatial_visualizations(df_features)
Output:
Note- The charts above are originally interactive and dynamic. However, they are shown here as static images for display purposes. In a real project or dashboard, you can hover, zoom, and filter data directly on these plots for deeper exploration.
In this step, we explore how various socioeconomic factors correlate with crime patterns across cities and time. This visual analysis helps uncover patterns and guide data-driven policy insights.
We perform the following analyses:
def analyze_socioeconomic_trends(df):
"""
Analyze socio-economic trends and their impact on crime using visualizations.
"""
# Set the overall figure size
plt.figure(figsize=(15, 10))
# -------------------------------
# Subplot 1: Population vs Crime Rate
# -------------------------------
plt.subplot(2, 3, 1)
# Group data to get one population value and average crime rate per city
city_stats = df.groupby('City').agg({
'Population (in Lakhs) (2011)+': 'first',
'Crime Rate': 'mean'
}).reset_index()
# Scatter plot to see correlation between population and crime rate
plt.scatter(city_stats['Population (in Lakhs) (2011)+'], city_stats['Crime Rate'])
plt.xlabel('Population (in Lakhs)')
plt.ylabel('Average Crime Rate')
plt.title('Population vs Crime Rate')
# Add city names next to their points
for i, row in city_stats.iterrows():
plt.annotate(row['City'], (row['Population (in Lakhs) (2011)+'], row['Crime Rate']),
fontsize=8, alpha=0.7)
# -------------------------------
# Subplot 2: Overall Crime Rate Trend Over Time
# -------------------------------
plt.subplot(2, 3, 2)
# Average crime rate per year across all cities
yearly_trend = df.groupby('Year')['Crime Rate'].mean()
# Line plot to show trend
plt.plot(yearly_trend.index, yearly_trend.values, marker='o')
plt.xlabel('Year')
plt.ylabel('Average Crime Rate')
plt.title('Overall Crime Rate Trend')
# -------------------------------
# Subplot 3: Crime Type Distribution
# -------------------------------
plt.subplot(2, 3, 3)
# Average crime rate per crime type
crime_type_avg = df.groupby('Type')['Crime Rate'].mean().sort_values(ascending=False)
# Bar plot to compare average crime rates across types
plt.bar(range(len(crime_type_avg)), crime_type_avg.values)
plt.xticks(range(len(crime_type_avg)), crime_type_avg.index, rotation=45)
plt.ylabel('Average Crime Rate')
plt.title('Average Crime Rate by Type')
# -------------------------------
# Subplot 4: Top 5 Cities by Crime Rate
# -------------------------------
plt.subplot(2, 3, 4)
# Select top 5 cities based on average crime rate
top_cities = city_stats.nlargest(5, 'Crime Rate')
# Bar plot for top 5 cities
plt.bar(top_cities['City'], top_cities['Crime Rate'])
plt.xticks(rotation=45)
plt.ylabel('Average Crime Rate')
plt.title('Top 5 Cities by Crime Rate')
# -------------------------------
# Subplot 5: Crime Rate by City Size Category (if column exists)
# -------------------------------
plt.subplot(2, 3, 5)
if 'City_Size' in df.columns:
# Average crime rate for each city size category (e.g., metro, small, medium)
size_crime = df.groupby('City_Size')['Crime Rate'].mean()
# Bar plot to compare across city sizes
plt.bar(size_crime.index, size_crime.values)
plt.xlabel('City Size Category')
plt.ylabel('Average Crime Rate')
plt.title('Crime Rate by City Size')
# -------------------------------
# Subplot 6: Year-over-Year % Change in Crime Rate (if column exists)
# -------------------------------
plt.subplot(2, 3, 6)
if 'Crime_Rate_pct_change' in df.columns:
# Average yearly percentage change across all cities
yearly_change = df.groupby('Year')['Crime_Rate_pct_change'].mean().dropna()
# Line plot showing trend in change rate
plt.plot(yearly_change.index, yearly_change.values, marker='o', color='red')
plt.xlabel('Year')
plt.ylabel('Average % Change in Crime Rate')
plt.title('Year-over-Year Crime Rate Changes')
plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)
# Adjust spacing between plots
plt.tight_layout()
plt.show()
# Run the analysis function
analyze_socioeconomic_trends(df_features)
Output:
In this final step, we evaluate and compare the performance of both regression and classification models. This allows us to make informed decisions about which models are best suited for predicting crime rates or classifying risk levels.
What we perform here:
def evaluate_and_compare_models(regression_results, classification_results):
"""
Create comprehensive model evaluation and comparison
for both regression and classification models.
"""
# -----------------------------------------
# Regression Model Metrics Table
# -----------------------------------------
# Convert regression metrics to DataFrame
reg_metrics_df = pd.DataFrame({
model_name: {
'RMSE': results['RMSE'],
'MAE': results['MAE'],
'R²': results['R2']
}
for model_name, results in regression_results.items()
}).T # Transpose to have model names as rows
print("Regression Model Performance:")
print(reg_metrics_df.round(4)) # Display metrics with 4 decimal precision
# -----------------------------------------
# Classification Model Metrics Table
# -----------------------------------------
# Convert classification accuracy to DataFrame
class_metrics_df = pd.DataFrame({
model_name: {
'Accuracy': results['accuracy']
}
for model_name, results in classification_results.items()
}).T
print("\nClassification Model Performance:")
print(class_metrics_df.round(4)) # Display accuracy with 4 decimal precision
# -----------------------------------------
# Visualization: Regression and Classification Metrics
# -----------------------------------------
fig, axes = plt.subplots(1, 2, figsize=(15, 6)) # Create 1x2 grid for plots
# Bar plot for regression metrics (RMSE and MAE)
reg_metrics_df[['RMSE', 'MAE']].plot(kind='bar', ax=axes[0])
axes[0].set_title('Regression Model Performance (Lower is Better)')
axes[0].set_ylabel('Error Metrics')
axes[0].tick_params(axis='x', rotation=45) # Rotate x-axis labels for readability
# Bar plot for classification model accuracy
class_metrics_df['Accuracy'].plot(kind='bar', ax=axes[1], color='green')
axes[1].set_title('Classification Model Accuracy (Higher is Better)')
axes[1].set_ylabel('Accuracy')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()
# Return DataFrames for future use
return reg_metrics_df, class_metrics_df
# Call the evaluation function
reg_comparison, class_comparison = evaluate_and_compare_models(regression_results, classification_results)
Output:
Regression Model Performance:
Model |
RMSE |
MAE |
R² |
Linear Regression | 4.1573 | 1.6639 | 0.9441 |
Ridge Regression | 4.2200 | 1.6795 | 0.9424 |
Lasso Regression | 4.6333 | 1.8087 | 0.9305 |
Random Forest | 4.9341 | 1.9306 | 0.9212 |
XGBoost | 4.7356 | 1.8450 | 0.9274 |
Classification Model Performance:
Model |
Accuracy |
Logistic Regression | 0.2707 |
Random Forest | 0.4737 |
XGBoost Classifier | 0.4586 |
In this project, our goal was to:
We trained and tested different machine learning models and compared their performance to find the best ones.
Takeaway:
In this step, we generate a complete summary report for our Crime Rate Prediction Project.
This final report brings everything together in a readable format.
Here's what we include in the report:
Here is the code:
def generate_project_report(df, regression_results, classification_results):
"""
Generate a comprehensive report for the Crime Rate Prediction Project.
The report includes:
- Dataset summary
- Key insights and trends
- Regression and classification model performance
- Best performing models
"""
print("=" * 80)
print("CRIME RATE PREDICTION PROJECT - COMPREHENSIVE REPORT")
print("=" * 80)
# Dataset overview: Size, time span, city count, crime types
print("\n1. DATASET OVERVIEW:")
print(f" • Total Records: {len(df):,}")
print(f" • Time Period: {df['Year'].min()} - {df['Year'].max()}")
print(f" • Number of Cities: {df['City'].nunique()}")
print(f" • Crime Types Analyzed: {df['Type'].nunique()}")
print(f" • Crime Types: {', '.join(df['Type'].unique())}")
# Key findings section starts here
print("\n2. KEY FINDINGS:")
# Correlation between population and crime rate
pop_crime_corr = df.groupby('City').agg({
'Population (in Lakhs) (2011)+': 'first',
'Crime Rate': 'mean'
})
correlation = pop_crime_corr['Population (in Lakhs) (2011)+'].corr(pop_crime_corr['Crime Rate'])
print(f" • Population-Crime Rate Correlation: {correlation:.3f}")
# City with the highest average crime rate
avg_by_city = df.groupby('City')['Crime Rate'].mean()
highest_crime_city = avg_by_city.idxmax()
highest_crime_rate = avg_by_city.max()
print(f" • Highest Average Crime Rate City: {highest_crime_city} ({highest_crime_rate:.2f})")
# Most common (prevalent) crime type
avg_by_type = df.groupby('Type')['Crime Rate'].mean()
highest_crime_type = avg_by_type.idxmax()
highest_type_rate = avg_by_type.max()
print(f" • Most Prevalent Crime Type: {highest_crime_type} ({highest_type_rate:.2f})")
# Trend in crime rate from start to end year
yearly_change = df.groupby('Year')['Crime Rate'].mean()
overall_trend = "Increasing" if yearly_change.iloc[-1] > yearly_change.iloc[0] else "Decreasing"
print(f" • Overall Crime Trend ({df['Year'].min()}-{df['Year'].max()}): {overall_trend}")
# Model performance summary
print("\n3. MODEL PERFORMANCE SUMMARY:")
# Best regression model based on lowest RMSE
best_reg_model = min(regression_results.items(), key=lambda x: x[1]['RMSE'])
print(f" • Best Regression Model: {best_reg_model[0]}")
print(f" - RMSE: {best_reg_model[1]['RMSE']:.4f}")
print(f" - R²: {best_reg_model[1]['R2']:.4f}")
# Best classification model based on highest accuracy
best_class_model = max(classification_results.items(), key=lambda x: x[1]['accuracy'])
print(f" • Best Classification Model: {best_class_model[0]}")
print(f" - Accuracy: {best_class_model[1]['accuracy']:.4f}")
Output:
Dataset Overview:
Detail |
Value |
Total Records | 1,520 |
Time-Period | 2014 - 2021 |
Number of Cities | 19 |
Crime Types Analyzed | 10 |
Crime Types | Crime Committed by Juveniles, Crime against SC, Crime against ST, Crime against Senior Citizen, Crime against children, Crime against women, Cyber Crimes, Economic Offences, Kidnapping, Murder |
Key Insights from the Data
Model Performance Summary:
Regression Models (Predicting exact crime rates)
Model |
RMSE |
R² |
Linear Regression | 4.1573 | 0.9441 |
Ridge Regression | 4.2200 | 0.9424 |
Lasso Regression | 4.6333 | 0.9305 |
Random Forest | 4.9341 | 0.9212 |
XGBoost | 4.7356 | 0.9274 |
Best Regression Model: Linear Regression (Highest R² and lowest RMSE)
Classification Models (Classifying crime rate as High/Low)
Model |
Accuracy |
Logistic Regression | 0.2707 |
Random Forest | 0.4737 |
XGBoost Classifier | 0.4586 |
Best Classification Model: Random Forest (Highest accuracy)
This crime rate prediction project not only provided valuable insights into crime patterns across Indian cities but also served as a hands-on demonstration of essential data science skills and their real-world application. Here’s what we gained from this comprehensive analysis:
Skills Demonstrated
In short, this Crime Rate Prediction Project showcased a full data pipeline, from raw ingestion to insights and forecasting, using scalable and industry-relevant tools, making it a strong technical case study for aspiring data scientists.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1W2yhfks2GibtFilELHL5mT3LfdN8yEqC?usp=sharing
804 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources