Literacy Rate Prediction and Analysis with Python
By Rohit Sharma
Updated on Aug 11, 2025 | 12 min read | 1.65K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 11, 2025 | 12 min read | 1.65K+ views
Share:
Table of Contents
Understanding literacy levels is key to driving educational policies and social development. In this project, we focus on Literacy Rate Prediction using district-wise data from India.
We use regression models like Linear Regression, Random Forest, and Gradient Boosting to predict literacy rates based on socio-economic and demographic features. The workflow includes data preprocessing, model training, and evaluation.
Popular Data Science Programs
Explore this collection of Python Data Science Projects for all skill levels.
To work effectively on the Literacy Rate Prediction project, make sure you're comfortable with the following:
If you're new to Python, check out this free upGrad course to boost your skills!- Learn Basic Python Programming
upGrad's globally recognised programs empower you to lead and innovate in a data-first world. Master Generative AI, solve real-world problems with Advanced Analytics, learn from industry veterans, and earn valuable credentials.
To predict district-wise literacy rates, we built regression models using historical census data. The models learn patterns from socio-economic features to estimate the literacy rate in each district. Here's what we did:
Explore Python projects perfect for beginners!- Sales Data Analysis Project | Customer Churn Prediction Project: From Data to Decisions
Estimated Time to Complete: The Literacy Rate Prediction and Analysis project is estimated to take 3 to 4 hours to complete. The time may vary depending on your familiarity with Python, especially in data cleaning, feature engineering, regression modelling, and performance evaluation.
Here’s how you can build the Literacy Rate Prediction and Analysis project from scratch using Python and machine learning:
1. Load the Literacy Dataset
Import district-level data with features like total population, male/female population, rural/urban stats, and educational indicators.
2. Clean and Preprocess the Data
Handle missing values, drop irrelevant columns, and format data types to prepare it for analysis.
3. Explore and Visualise the Data
Use histograms, box plots, correlation heatmaps, and scatter plots to understand the feature relationships and data distribution.
4. Train Regression Models
Apply algorithms like Linear Regression, Random Forest Regressor, and Gradient Boosting Regressor to predict literacy rates.
5. Evaluate Model Performance
Use metrics like R² score, RMSE, and MAE to assess how accurately each model predicts the literacy rate.
Before we dive into this project, you'll need to grab the dataset for model training and import the necessary libraries. First, head over to Kaggle to download the dataset, and then you can bring in the libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
print("="*60)
print("LITERACY RATE PREDICTION ANALYSIS")
print("="*60)
print("-" * 30)
try:
df = pd.read_csv('2015_16_Districtwise.csv')
print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Number of districts: {len(df)}")
print(f"Number of features: {len(df.columns)}")
# Display basic info about the dataset
print("\nDataset Overview:")
print(f" • Columns: {list(df.columns[:5])}... (showing first 5)")
print(f" • Data types: {df.dtypes.value_counts().to_dict()}")
except FileNotFoundError:
print("Error: '2015_16_Districtwise.csv' not found. Please check the file path.")
exit()
Output:
Dataset loaded successfully! Dataset Overview:
|
Want to explore Data Science further? Find out more here! - Handwritten Digit Recognition with CNN Using Python | Weather Forecasting Model Using Machine Learning and Time Series Analysis
This step focuses on preparing the dataset for modelling by removing unnecessary columns, handling missing values, and selecting the most important features for prediction.
# Define the target variable
TARGET = 'OVERALL_LI'
# Define the selected feature columns
FEATURES = [
'TOTPOPULAT', 'P_URB_POP', 'P_SC_POP', 'P_ST_POP', 'TOT_6_10_15',
'TOT_11_13_15', 'SCHTOT', 'SCHTOTG', 'SCHTOTP', 'SCHTOTM',
'ENRTOT', 'ENRTOTG', 'ENRTOTP', 'ENRTOTM', 'TCHTOT', 'TCHTOTG',
'TCHTOTP', 'TCHTOTM', 'CLSTOT'
]
print(f"Target variable: {TARGET}")
print(f"Selected features ({len(FEATURES)}): {FEATURES[:3]}...")
# Check for any missing columns
missing_cols = [col for col in FEATURES + [TARGET] if col not in df.columns]
if missing_cols:
print(f"Missing columns: {missing_cols}")
FEATURES = [col for col in FEATURES if col in df.columns]
print(f"Updated features list: {len(FEATURES)} features available")
# Create a new DataFrame with selected columns
data = df[FEATURES + [TARGET]].copy()
# Analyze missing values
print(f"\nMissing Values Analysis:")
missing_summary = data.isnull().sum()
missing_cols = missing_summary[missing_summary > 0]
if len(missing_cols) > 0:
print(f"Columns with missing values: {len(missing_cols)}")
for col, count in missing_cols.items():
percent = count / len(data) * 100
print(f" - {col}: {count} missing ({percent:.1f}%)")
else:
print("No missing values found")
# Fill missing values with median
missing_handled = 0
for col in data.columns:
if data[col].isnull().any():
median_val = data[col].median()
missing_count = data[col].isnull().sum()
data[col].fillna(median_val, inplace=True)
missing_handled += missing_count
print(f"\nData cleaning completed")
print(f"Total missing values handled: {missing_handled}")
print(f"Final dataset shape: {data.shape}")
# Target variable statistics
print(f"\nTarget Variable ({TARGET}) Statistics:")
target_stats = data[TARGET].describe()
for stat, value in target_stats.items():
print(f" - {stat.capitalize()}: {value:.2f}%")
Output:
Target variable: OVERALL_LI Missing Values Analysis:
- TOTPOPULAT: 46 missing (6.8%) Data cleaning completed:
Target Variable (OVERALL_LI) Statistics:
|
Dive into this project!- Customer Purchase Behaviour Analysis Project Using Python
In this step, we explore patterns and relationships between features and the overall literacy rate. Visualisations like heatmaps, bar plots, and scatter plots help us understand which factors influence literacy the most.
# 1. Correlation Heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = data[FEATURES + [TARGET]].corr()
sns.heatmap(correlation_matrix[[TARGET]].sort_values(by=TARGET, ascending=False),
annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation with Overall Literacy Rate')
plt.savefig('eda_correlation_heatmap.png', dpi=300, bbox_inches='tight')
print("EDA Plot 1/4 saved: eda_correlation_heatmap.png")
# 2. Literacy Rate by State
plt.figure(figsize=(12, 10))
state_literacy = data.groupby('STATNAME')[TARGET].mean().sort_values(ascending=False)
sns.barplot(x=state_literacy.values, y=state_literacy.index, palette='viridis')
plt.title('Average Literacy Rate by State')
plt.xlabel('Average Overall Literacy Rate (%)')
plt.ylabel('State')
plt.savefig('eda_literacy_by_state.png', dpi=300, bbox_inches='tight')
print("EDA Plot 2/4 saved: eda_literacy_by_state.png")
# 3. Urban Population vs. Literacy Rate
plt.figure(figsize=(8, 6))
sns.scatterplot(data=data, x='P_URB_POP', y=TARGET, alpha=0.5)
plt.title('Urban Population (%) vs. Literacy Rate')
plt.xlabel('Percentage of Urban Population')
plt.ylabel('Overall Literacy Rate (%)')
plt.savefig('eda_urbanization_vs_literacy.png', dpi=300, bbox_inches='tight')
print("EDA Plot 3/4 saved: eda_urbanization_vs_literacy.png")
# 4. SC/ST Population vs. Literacy Rate
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.scatterplot(data=data, x='P_SC_POP', y=TARGET, ax=axes[0], color='salmon', alpha=0.5)
axes[0].set_title('Scheduled Caste Population (%) vs. Literacy Rate')
axes[0].set_xlabel('Percentage of SC Population')
axes[0].set_ylabel('Overall Literacy Rate (%)')
sns.scatterplot(data=data, x='P_ST_POP', y=TARGET, ax=axes[1], color='skyblue', alpha=0.5)
axes[1].set_title('Scheduled Tribe Population (%) vs. Literacy Rate')
axes[1].set_xlabel('Percentage of ST Population')
axes[1].set_ylabel('')
plt.tight_layout()
plt.savefig('eda_sc_st_vs_literacy.png', dpi=300, bbox_inches='tight')
Output:
Just a quick and easy Python project, perfect for anyone starting out!- Complete Airline Passenger Traffic Analysis Project Using Python
In this step, we separate the features and the target variable. We then split the data into training and testing sets and apply feature scaling using StandardScaler to standardise the input values for better model performance.
Here is the code for this step:
# Define our features (X) and target (y)
X = data[FEATURES]
y = data[TARGET]
print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"\nData Split Results:")
print(f" • Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f" • Testing samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f" • Training target range: {y_train.min():.1f}% - {y_train.max():.1f}%")
print(f" • Testing target range: {y_test.min():.1f}% - {y_test.max():.1f}%")
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"\nFeature Scaling Completed:")
print(f" • Scaler fitted on training data")
print(f" • Both training and testing features scaled")
print(f" • Feature means after scaling (training): {np.mean(X_train_scaled, axis=0)[:3].round(3)}... (first 3)")
print(f" • Feature std after scaling (training): {np.std(X_train_scaled, axis=0)[:3].round(3)}... (first 3)")
Output:
Feature matrix shape: (680, 19) Data Split Results:
Feature Scaling Completed:
|
Take a look at this project!- Loan Default Risk Analysis Using Machine Learning Techniques.
In this step, we train three regression models, Linear Regression, Random Forest Regressor, and Gradient Boosting Regressor, to predict literacy rates. We evaluate their performance using MAE, RMSE, and R² on both training and test sets to find the best models.
Here is the code for this step:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
models = {
"Linear Regression": LinearRegression(),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42)
}
results = {}
detailed_results = {}
print("Training models...")
for i, (name, model) in enumerate(models.items(), 1):
print(f"\nTraining Model {i}/3: {name}")
model.fit(X_train_scaled, y_train)
print(" Training completed")
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
results[name] = test_r2
detailed_results[name] = {
'train_mae': train_mae, 'train_mse': train_mse, 'train_r2': train_r2,
'test_mae': test_mae, 'test_mse': test_mse, 'test_r2': test_r2,
'predictions': y_test_pred
}
print(f" Training Performance:")
print(f" • MAE: {train_mae:.2f}")
print(f" • RMSE: {np.sqrt(train_mse):.2f}")
print(f" • R²: {train_r2:.3f}")
print(f" Testing Performance:")
print(f" • MAE: {test_mae:.2f}")
print(f" • RMSE: {np.sqrt(test_mse):.2f}")
print(f" • R²: {test_r2:.3f}")
overfit_indicator = train_r2 - test_r2
if overfit_indicator > 0.1:
print(f" Potential overfitting detected (difference: {overfit_indicator:.3f})")
else:
print(f" Good generalization (difference: {overfit_indicator:.3f})")
Output:
Training models... Training Model 1/3: Linear Regression Training completed Training Performance:
Testing Performance:
Good generalisation (difference: 0.093) Training Model 2/3: Random Forest Training completed Training Performance:
Testing Performance:
Potential overfitting detected (difference: 0.434) Training Model 3/3: Gradient Boosting Training completed Training Performance:
Testing Performance:
Potential overfitting detected (difference: 0.389) |
In this step, we compare all trained regression models using common evaluation metrics: R² Score, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). This helps us identify the best-performing model for literacy rate prediction. We also analyse feature importance for tree-based models.
# Create comparison table
print(f"\nModel Performance Comparison:")
print(f"{'Model':<20} {'Train R²':<10} {'Test R²':<10} {'Test MAE':<10} {'Test RMSE':<10}")
print("-" * 65)
for name in models.keys():
dr = detailed_results[name]
rmse = np.sqrt(dr['test_mse'])
print(f"{name:<20} {dr['train_r2']:<10.3f} {dr['test_r2']:<10.3f} {dr['test_mae']:<10.2f} {rmse:<10.2f}")
# Find the best model based on R² score
best_model_name = max(results, key=results.get)
best_model = models[best_model_name]
best_results = detailed_results[best_model_name]
print(f"\nBEST MODEL: {best_model_name}")
print(f" • R² Score: {results[best_model_name]:.3f}")
print(f" • Mean Absolute Error: {best_results['test_mae']:.2f}%")
print(f" • Root Mean Squared Error: {np.sqrt(best_results['test_mse']):.2f}%")
# Feature importance for tree-based models
if hasattr(best_model, 'feature_importances_'):
print(f"\nFeature Importance Analysis ({best_model_name}):")
feature_importance = pd.DataFrame({
'feature': FEATURES,
'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)
print(" Top 5 Most Important Features:")
for i, (_, row) in enumerate(feature_importance.head().iterrows(), 1):
print(f" {i}. {row['feature']}: {row['importance']:.3f}")
Output:
Model Performance Comparison: ----------------------------------------------------------------- Linear Regression 0.467 0.375 5.70 7.50 BEST MODEL: Random Forest
Feature Importance Analysis (Random Forest): Top 5 Most Important Features: 1. P_URB_POP: 0.276 |
Want to spot fraud in transactions? Check this out!- Fraud Detection in Transactions with Python: A Machine Learning Project
This step involves creating detailed visualisations to evaluate the performance of the best model. You’ll compare actual vs predicted values, analyse residuals, review model scores, examine feature importance, and explore prediction errors.
# Create comprehensive visualizations
fig = plt.figure(figsize=(15, 10))
# 1. Actual vs Predicted scatter plot for best model
plt.subplot(2, 3, 1)
y_pred_best = best_results['predictions']
plt.scatter(y_test, y_pred_best, alpha=0.6, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', color='red', lw=2)
plt.title(f'{best_model_name}: Actual vs Predicted', fontsize=12)
plt.xlabel('Actual Literacy Rate (%)')
plt.ylabel('Predicted Literacy Rate (%)')
plt.grid(True, alpha=0.3)
# 2. Residuals plot
plt.subplot(2, 3, 2)
residuals = y_test - y_pred_best
plt.scatter(y_pred_best, residuals, alpha=0.6, color='green')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals Plot')
plt.xlabel('Predicted Literacy Rate (%)')
plt.ylabel('Residuals')
plt.grid(True, alpha=0.3)
# 3. Model comparison bar chart
plt.subplot(2, 3, 3)
model_names = list(results.keys())
r2_scores = list(results.values())
bars = plt.bar(model_names, r2_scores, color=['skyblue', 'lightgreen', 'salmon'])
plt.title('Model Performance Comparison')
plt.ylabel('R² Score')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
# Add value labels on bars
for bar, score in zip(bars, r2_scores):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom')
# 4. Distribution of target variable
plt.subplot(2, 3, 4)
plt.hist(y, bins=30, alpha=0.7, color='purple', edgecolor='black')
plt.title('Distribution of Literacy Rates')
plt.xlabel('Literacy Rate (%)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
# 5. Feature importance (if available)
plt.subplot(2, 3, 5)
if hasattr(best_model, 'feature_importances_'):
top_features = feature_importance.head(8)
plt.barh(top_features['feature'], top_features['importance'], color='orange')
plt.title(f'Top Features ({best_model_name})')
plt.xlabel('Importance')
plt.gca().invert_yaxis()
else:
plt.text(0.5, 0.5, 'Feature importance\nnot available for\nLinear Regression',
ha='center', va='center', transform=plt.gca().transAxes, fontsize=12)
plt.title('Feature Importance')
# 6. Prediction error distribution
plt.subplot(2, 3, 6)
plt.hist(residuals, bins=20, alpha=0.7, color='red', edgecolor='black')
plt.title('Prediction Error Distribution')
plt.xlabel('Prediction Error (%)')
plt.ylabel('Frequency')
plt.axvline(x=0, color='black', linestyle='--', alpha=0.7)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('literacy_prediction_comprehensive_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
Output:
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
The Literacy Rate Prediction and Analysis project successfully explored how various socio-economic and demographic factors influence literacy rates across Indian districts. After thorough data cleaning, preprocessing, and feature engineering, multiple regression models were trained and compared.
The Random Forest Regressor emerged as the most accurate model, delivering the best R² score. Visualisations like actual vs predicted plots, residuals, and feature importance provided deeper insight into model performance and influential predictors.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1Z9rj7Ta_C1mWw_x5km9lbB6VLuzZ05la?usp=sharing
This project aims to analyse and predict the overall literacy rate across Indian districts using demographic and population-based features from the 2015–16 dataset.
Linear Regression, Random Forest Regressor, and Gradient Boosting Regressor were applied and compared based on their performance using R² score and RMSE.
Urban population percentage, female population percentage, and SC/ST population percentages showed strong correlations with literacy levels.
The models were evaluated using R² score and Root Mean Squared Error (RMSE) on the test dataset to check prediction accuracy and reliability.
Yes, with updated and consistent data, the trained models can help forecast literacy trends and support data-driven policy planning.
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources