Home
Blog
Data Science
Titanic Dataset - Survival Prediction Model Using Machine Learning

Titanic Dataset - Survival Prediction Model Using Machine Learning

Updated on Jul 30, 2025 | 9 min read | 2.02K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty Level
How to Build a Titanic Survival Prediction Model
Conclusion

The Titanic dataset is a classic in machine learning. It contains details of passengers aboard the Titanic, such as age, gender, class, fare, and more. The primary objective is to build a classification model that predicts whether a passenger will survive or not.

In this project, we will preprocess the dataset, handle missing values, and apply classification algorithms to identify which factors most influenced survival. We will then evaluate and compare multiple models to find the most effective one.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

Popular Data Science Programs

Masters in Data Science Degree Advanced Certificate Program in Data Science Postgraduate Diploma in Data Science Data Science Machine Learning Course Cloud Computing Courses Certification

What Should You Know Beforehand?

It is better to have at least some background in:

Python programming – functions, loops, conditionals, and using libraries like pandas or seaborn.
Pandas and NumPy – for loading the Titanic dataset, handling missing values, and feature selection.
Exploratory Data Analysis (EDA) – understanding distributions, patterns, and visualizing passenger attributes.
Machine learning basics – especially binary classification using supervised learning algorithms.
Train-test split and evaluation metrics – like accuracy, confusion matrix, precision, recall, and F1-score.

Technologies and Libraries Used

For this project, the following tools and libraries will be used for the Titanic dataset project:

Tool/Library	Purpose
Python	Main programming language used to build and run the model
Pandas	Data loading, cleaning, and manipulation
NumPy	Numerical operations and array handling
Matplotlib & Seaborn	Data visualization and pattern analysis
Scikit-learn (sklearn)	Preprocessing, model training, and evaluation
Google Colab	Interactive environment to write and execute code

Models That Will Be Utilized for Learning

Below are the models that we will be utilizing:

Model	Why It’s Used
Logistic Regression	Simple, interpretable model ideal for binary classification like survival prediction
K-Nearest Neighbors (KNN)	Easy-to-understand, distance-based algorithm to classify survival based on similarity
Decision Tree Classifier	Helps visualize decision-making based on key features like age, class, and gender
Random Forest Classifier	An ensemble method that reduces overfitting and improves prediction accuracy

Time Taken and Difficulty Level

On average, it will take about 1 to 2 hours to complete. Duration may vary depending on your familiarity with Python, EDA (Exploratory Data Analysis), & ML concepts. It’s best for beginner-level.

How to Build a Titanic Survival Prediction Model

Let’s start building the project from scratch. We will start by:

Load the training and test datasets provided by Kaggle
Perform exploratory data analysis (EDA) to understand trends and missing values
Clean and preprocess the data (handle missing values, drop unnecessary columns)
Convert categorical variables into a numeric format
Select relevant features for the prediction task
Train multiple machine learning models
Evaluate and compare their performance using accuracy and other metrics

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the Titanic survival prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/datasets/yasserh/titanic-dataset/data.
On the Titanic Dataset page, in the right pane, under the Data Explorer section, click Titanic-Dataset.csv.
Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the files.

from google.colab import files
uploaded = files.upload()

Once uploaded, read it using Pandas. Here’s the code to do so:

import pandas as pd
# Load the dataset
df = pd.read_csv('Titanic-Dataset.csv')
# Show the first few rows
df.head()

Output:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

What does the output tell us?

The output tells us that -

The target column Survived shows binary values. 0 (not survived) and 1 (survived). This confirms that it’s a binary classification problem.
The dataset contains a mix of numerical (Age, Fare, etc.), categorical (Sex, Embarked), and textual (Name, Ticket) features.
Some columns, like Cabin, have many missing values (NaNs). Age and Embarked also appear to have a few missing entries.

Now we have some idea of the structure and key columns. But let’s explore the data more closely to spot patterns that might influence survival in the next step.

Step 3: Explore the Dataset (EDA)

In this step, we will use visualizations and statistics to examine relationships between features like gender, class, and age with survival outcomes.

Use the code below to accomplish the same:

# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv("Titanic-Dataset.csv")
# Display first 5 rows
print("First 5 Rows:")
print(df.head())
# Dataset info
print("\nDataset Info:")
print(df.info())
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Descriptive statistics
print("\nSummary Statistics:")
print(df.describe())
# Survival count plot
plt.figure(figsize=(5, 4))
sns.countplot(x='Survived', data=df)
plt.title("Survival Count (0 = No, 1 = Yes)")
plt.show()
# Survival by gender
plt.figure(figsize=(6, 4))
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title("Survival by Gender")
plt.show()
# Survival by passenger class
plt.figure(figsize=(6, 4))
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title("Survival by Passenger Class")
plt.show()
# Age distribution
plt.figure(figsize=(6, 4))
sns.histplot(df['Age'].dropna(), kde=True, bins=30)
plt.title("Age Distribution of Passengers")
plt.xlabel("Age")
plt.show()

Output:

First 5 Rows:

PassengerId Survived Pclass \

0 1 0 3

1 2 1 1

2 3 1 3

3 4 1 1

4 5 0 3

Name Sex Age SibSp \

0 Braund, Mr. Owen Harris male 22.0 1

1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1

2 Heikkinen, Miss. Laina female 26.0 0

3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1

4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked

0 0 A/5 21171 7.2500 NaN S

1 0 PC 17599 71.2833 C85 C

2 0 STON/O2. 3101282 7.9250 NaN S

3 0 113803 53.1000 C123 S

4 0 373450 8.0500 NaN S

Dataset Info:

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 PassengerId 891 non-null int64

1 Survived 891 non-null int64

2 Pclass 891 non-null int64

3 Name 891 non-null object

4 Sex 891 non-null object

5 Age 714 non-null float64

6 SibSp 891 non-null int64

7 Parch 891 non-null int64

8 Ticket 891 non-null object

9 Fare 891 non-null float64

10 Cabin 204 non-null object

11 Embarked 889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 83.7+ KB

None

Missing Values:

PassengerId 0

Survived 0

Pclass 0

Name 0

Sex 0

Age 177

SibSp 0

Parch 0

Ticket 0

Fare 0

Cabin 687

Embarked 2

dtype: int64

Summary Statistics:

PassengerId Survived Pclass Age SibSp \

count 891.000000 891.000000 891.000000 714.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008

std 257.353842 0.486592 0.836071 14.526497 1.102743

min 1.000000 0.000000 1.000000 0.420000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000

50% 446.000000 0.000000 3.000000 28.000000 0.000000

75% 668.500000 1.000000 3.000000 38.000000 1.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000

Parch Fare

count 891.000000 891.000000

mean 0.381594 32.204208

std 0.806057 49.693429

min 0.000000 0.000000

25% 0.000000 7.910400

50% 0.000000 14.454200

75% 0.000000 31.000000

max 6.000000 512.329200

What does the output tell us?

The output tells us that:

More passengers did not survive (Survived = 0) than those who did (Survived = 1). Rough ratio: ~38% survived, ~62% did not.
A significantly higher proportion of females survived compared to males.
Survival was higher in the 1st class and dropped sharply in the 3rd class.
Most passengers were between 20–40 years old.
The cabin has the most missing values (687).
Age and Embarked also have missing data, but at manageable levels

Step 4: Data Preprocessing

Now that we have identified missing values and key variables, it’s time to clean the dataset and make it model-ready. In this step, we will:

Handle missing values
Encode categorical variables
Drop irrelevant columns
Scale or transform features, if needed

Use the code given below to achieve this:

# Import required libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Load the dataset again (if not already loaded)
df = pd.read_csv('/content/Titanic-Dataset.csv')
# Drop irrelevant columns
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Encode categorical variables
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
df['Sex'] = le_sex.fit_transform(df['Sex'])       # male=1, female=0
df['Embarked'] = le_embarked.fit_transform(df['Embarked'])  # S=2, C=0, Q=1
# Final dataset preview
df.head()

Output:

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
0	0	3	1	22.0	1	0	7.2500	2
1	1	1	0	38.0	1	0	71.2833	0
2	1	3	0	26.0	0	0	7.9250	2
3	1	1	0	35.0	1	0	53.1000	2
4	0	3	1	35.0	0	0	8.0500	2

Now we have cleaned and preprocessed the data. Let’s move ahead.

Step 5: Train and Evaluate ML Models

In this step, we will train below mentioned classification algorithms on the Titanic dataset to predict survival:

Logistic Regression
K-Nearest Neighbors (KNN)
Decision Tree Classifier
Random Forest Classifier

Let’s train them all in a single script and compare their accuracy. Use the code below:

# Step 5: Train and Evaluate ML Models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Features and target
X = df.drop('Survived', axis=1)
y = df['Survived']
# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize models
models = {
    "Logistic Regression": LogisticRegression(max_iter=200),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100)
}
# Train and evaluate
for name, model in models.items():
    model.fit(X_train, y_train)                    # Train the model
    y_pred = model.predict(X_test)                 # Make predictions
    acc = accuracy_score(y_test, y_pred)           # Compute accuracy
    print(f"{name} Accuracy: {acc:.4f}")

Output:

Logistic Regression Accuracy: 0.8101
K-Nearest Neighbors Accuracy: 0.7039
Decision Tree Accuracy: 0.7709
Random Forest Accuracy: 0.8268

From the output, we can see that the Random Forest Classifier achieved the highest accuracy - 82.68%.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Conclusion

The Random Forest Classifier produced the best results out of all the models that were tested, with an astounding accuracy of 82.68%. Highest among the group. In the meantime, with a strong accuracy of 81.01%, Logistic Regression also did well.

In contrast, the Decision Tree Classifier performed moderately well (77.09%), albeit with a slight propensity for overfitting. However, with the lowest accuracy of 70.39%, K-Nearest Neighbors (KNN) trailed behind. It might be as a result of the selection of "k" neighbors and their sensitivity to feature scaling.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link -
https://colab.research.google.com/drive/10WTMZomI4OhO8Hy7fBM_Y0tryLTUOySg

Frequently Asked Questions (FAQs)

1. What is the Titanic Dataset used for?

The Titanic Dataset is commonly used to build classification models that predict which passengers survived the Titanic shipwreck based on features like age, gender, fare, and class.

2. What features are included in the Titanic Dataset?

Key features include passenger class (Pclass), age, sex, fare, number of siblings/spouses (SibSp), number of parents/children (Parch), and embarkation port.

3. Which machine learning algorithms are suitable for this dataset?

Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines (SVM) are popular algorithms used to solve the survival prediction problem.

4. What tools and libraries can be used for this project?

You can use Python with libraries like Pandas (data handling), Matplotlib/Seaborn (visualization), Scikit-learn (modeling), and NumPy for numerical operations.

5. What are some project extensions beyond survival prediction?

You can extend the project by optimizing hyperparameters, building ensemble models, exploring missing data handling techniques, or deploying the model using Flask.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources