Titanic Dataset - Survival Prediction Model Using Machine Learning
By Rohit Sharma
Updated on Jul 30, 2025 | 9 min read | 1.46K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 30, 2025 | 9 min read | 1.46K+ views
Share:
Table of Contents
The Titanic dataset is a classic in machine learning. It contains details of passengers aboard the Titanic, such as age, gender, class, fare, and more. The primary objective is to build a classification model that predicts whether a passenger will survive or not.
In this project, we will preprocess the dataset, handle missing values, and apply classification algorithms to identify which factors most influenced survival. We will then evaluate and compare multiple models to find the most effective one.
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
Popular Data Science Programs
It is better to have at least some background in:
For this project, the following tools and libraries will be used for the Titanic dataset project:
Tool/Library |
Purpose |
Python |
Main programming language used to build and run the model |
Pandas |
Data loading, cleaning, and manipulation |
NumPy |
Numerical operations and array handling |
Matplotlib & Seaborn |
Data visualization and pattern analysis |
Scikit-learn (sklearn) |
Preprocessing, model training, and evaluation |
Google Colab |
Interactive environment to write and execute code |
Below are the models that we will be utilizing:
Model |
Why It’s Used |
Simple, interpretable model ideal for binary classification like survival prediction |
|
Easy-to-understand, distance-based algorithm to classify survival based on similarity |
|
Helps visualize decision-making based on key features like age, class, and gender |
|
An ensemble method that reduces overfitting and improves prediction accuracy |
On average, it will take about 1 to 2 hours to complete. Duration may vary depending on your familiarity with Python, EDA (Exploratory Data Analysis), & ML concepts. It’s best for beginner-level.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To build the Titanic survival prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:
Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the files.
from google.colab import files
uploaded = files.upload()
Once uploaded, read it using Pandas. Here’s the code to do so:
import pandas as pd
# Load the dataset
df = pd.read_csv('Titanic-Dataset.csv')
# Show the first few rows
df.head()
Output:
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
|
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S
|
What does the output tell us?
The output tells us that -
Now we have some idea of the structure and key columns. But let’s explore the data more closely to spot patterns that might influence survival in the next step.
In this step, we will use visualizations and statistics to examine relationships between features like gender, class, and age with survival outcomes.
Use the code below to accomplish the same:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv("Titanic-Dataset.csv")
# Display first 5 rows
print("First 5 Rows:")
print(df.head())
# Dataset info
print("\nDataset Info:")
print(df.info())
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Descriptive statistics
print("\nSummary Statistics:")
print(df.describe())
# Survival count plot
plt.figure(figsize=(5, 4))
sns.countplot(x='Survived', data=df)
plt.title("Survival Count (0 = No, 1 = Yes)")
plt.show()
# Survival by gender
plt.figure(figsize=(6, 4))
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title("Survival by Gender")
plt.show()
# Survival by passenger class
plt.figure(figsize=(6, 4))
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title("Survival by Passenger Class")
plt.show()
# Age distribution
plt.figure(figsize=(6, 4))
sns.histplot(df['Age'].dropna(), kde=True, bins=30)
plt.title("Age Distribution of Passengers")
plt.xlabel("Age")
plt.show()
Output:
First 5 Rows:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
Missing Values:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Summary Statistics:
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
What does the output tell us?
The output tells us that:
Now that we have identified missing values and key variables, it’s time to clean the dataset and make it model-ready. In this step, we will:
Use the code given below to achieve this:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Load the dataset again (if not already loaded)
df = pd.read_csv('/content/Titanic-Dataset.csv')
# Drop irrelevant columns
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Encode categorical variables
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
df['Sex'] = le_sex.fit_transform(df['Sex']) # male=1, female=0
df['Embarked'] = le_embarked.fit_transform(df['Embarked']) # S=2, C=0, Q=1
# Final dataset preview
df.head()
Output:
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
|
0 |
0 |
3 |
1 |
22.0 |
1 |
0 |
7.2500 |
2 |
1 |
1 |
1 |
0 |
38.0 |
1 |
0 |
71.2833 |
0 |
2 |
1 |
3 |
0 |
26.0 |
0 |
0 |
7.9250 |
2 |
3 |
1 |
1 |
0 |
35.0 |
1 |
0 |
53.1000 |
2 |
4 |
0 |
3 |
1 |
35.0 |
0 |
0 |
8.0500 |
2
|
Now we have cleaned and preprocessed the data. Let’s move ahead.
In this step, we will train below mentioned classification algorithms on the Titanic dataset to predict survival:
Let’s train them all in a single script and compare their accuracy. Use the code below:
# Step 5: Train and Evaluate ML Models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Features and target
X = df.drop('Survived', axis=1)
y = df['Survived']
# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize models
models = {
"Logistic Regression": LogisticRegression(max_iter=200),
"K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(n_estimators=100)
}
# Train and evaluate
for name, model in models.items():
model.fit(X_train, y_train) # Train the model
y_pred = model.predict(X_test) # Make predictions
acc = accuracy_score(y_test, y_pred) # Compute accuracy
print(f"{name} Accuracy: {acc:.4f}")
Output:
Logistic Regression Accuracy: 0.8101
K-Nearest Neighbors Accuracy: 0.7039
Decision Tree Accuracy: 0.7709
Random Forest Accuracy: 0.8268
From the output, we can see that the Random Forest Classifier achieved the highest accuracy - 82.68%.
The Random Forest Classifier produced the best results out of all the models that were tested, with an astounding accuracy of 82.68%. Highest among the group. In the meantime, with a strong accuracy of 81.01%, Logistic Regression also did well.
In contrast, the Decision Tree Classifier performed moderately well (77.09%), albeit with a slight propensity for overfitting. However, with the lowest accuracy of 70.39%, K-Nearest Neighbors (KNN) trailed behind. It might be as a result of the selection of "k" neighbors and their sensitivity to feature scaling.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link -
https://colab.research.google.com/drive/10WTMZomI4OhO8Hy7fBM_Y0tryLTUOySg
803 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources