Titanic Dataset - Survival Prediction Model Using Machine Learning

By Rohit Sharma

Updated on Jul 30, 2025 | 9 min read | 1.46K+ views

Share:

The Titanic dataset is a classic in machine learning. It contains details of passengers aboard the Titanic, such as age, gender, class, fare, and more. The primary objective is to build a classification model that predicts whether a passenger will survive or not.

In this project, we will preprocess the dataset, handle missing values, and apply classification algorithms to identify which factors most influenced survival. We will then evaluate and compare multiple models to find the most effective one.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

What Should You Know Beforehand?

It is better to have at least some background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used for the Titanic dataset project:

Tool/Library

Purpose

Python

Main programming language used to build and run the model

Pandas

Data loading, cleaning, and manipulation

NumPy

Numerical operations and array handling

Matplotlib & Seaborn

Data visualization and pattern analysis

Scikit-learn (sklearn)

Preprocessing, model training, and evaluation

Google Colab

Interactive environment to write and execute code

Models That Will Be Utilized for Learning

Below are the models that we will be utilizing:

Model

Why It’s Used

Logistic Regression

Simple, interpretable model ideal for binary classification like survival prediction

K-Nearest Neighbors (KNN)

Easy-to-understand, distance-based algorithm to classify survival based on similarity

Decision Tree Classifier

Helps visualize decision-making based on key features like age, class, and gender

Random Forest Classifier

An ensemble method that reduces overfitting and improves prediction accuracy

Time Taken and Difficulty Level

On average, it will take about 1 to 2 hours to complete. Duration may vary depending on your familiarity with Python, EDA (Exploratory Data Analysis), & ML concepts. It’s best for beginner-level.

How to Build a Titanic Survival Prediction Model

Let’s start building the project from scratch. We will start by:

  • Load the training and test datasets provided by Kaggle
  • Perform exploratory data analysis (EDA) to understand trends and missing values
  • Clean and preprocess the data (handle missing values, drop unnecessary columns)
  • Convert categorical variables into a numeric format
  • Select relevant features for the prediction task
  • Train multiple machine learning models
  • Evaluate and compare their performance using accuracy and other metrics

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the Titanic survival prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/yasserh/titanic-dataset/data.
  3. On the Titanic Dataset page, in the right pane, under the Data Explorer section, click Titanic-Dataset.csv
  4. Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the files. 

from google.colab import files
uploaded = files.upload()

Once uploaded, read it using Pandas. Here’s the code to do so:

import pandas as pd
# Load the dataset
df = pd.read_csv('Titanic-Dataset.csv')
# Show the first few rows
df.head()

Output:

 

PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

22.0

1

0

A/5 21171

7.2500

NaN

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

38.0

1

0

PC 17599

71.2833

C85

C

2

3

1

3

Heikkinen, Miss. Laina

female

26.0

0

0

STON/O2. 3101282

7.9250

NaN

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35.0

1

0

113803

53.1000

C123

S

4

5

0

3

Allen, Mr. William Henry

male

35.0

0

0

373450

8.0500

NaN

S

 

 

What does the output tell us?

The output tells us that - 

  • The target column Survived shows binary values.  0 (not survived) and 1 (survived). This confirms that it’s a binary classification problem.
  • The dataset contains a mix of numerical (Age, Fare, etc.), categorical (Sex, Embarked), and textual (Name, Ticket) features.
  • Some columns, like Cabin, have many missing values (NaNs). Age and Embarked also appear to have a few missing entries.

Now we have some idea of the structure and key columns. But let’s explore the data more closely to spot patterns that might influence survival in the next step.

Step 3: Explore the Dataset (EDA)

In this step, we will use visualizations and statistics to examine relationships between features like gender, class, and age with survival outcomes.

Use the code below to accomplish the same:

# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv("Titanic-Dataset.csv")
# Display first 5 rows
print("First 5 Rows:")
print(df.head())
# Dataset info
print("\nDataset Info:")
print(df.info())
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Descriptive statistics
print("\nSummary Statistics:")
print(df.describe())
# Survival count plot
plt.figure(figsize=(5, 4))
sns.countplot(x='Survived', data=df)
plt.title("Survival Count (0 = No, 1 = Yes)")
plt.show()
# Survival by gender
plt.figure(figsize=(6, 4))
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title("Survival by Gender")
plt.show()
# Survival by passenger class
plt.figure(figsize=(6, 4))
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title("Survival by Passenger Class")
plt.show()
# Age distribution
plt.figure(figsize=(6, 4))
sns.histplot(df['Age'].dropna(), kde=True, bins=30)
plt.title("Age Distribution of Passengers")
plt.xlabel("Age")
plt.show()

Output:

First 5 Rows:

    PassengerId  Survived  Pclass  \

0                   1                0          3   

1                   2                1           1   

2                  3                1           3   

3                  4                1           1   

4                  5                0          3   

                                                                                   Name          Sex    Age  SibSp  \

0                                                  Braund, Mr. Owen Harris        male    22.0      1   

1        Cumings, Mrs. John Bradley (Florence Briggs Th...     female    38.0      1   

2                                                      Heikkinen, Miss. Laina     female    26.0      0   

3                  Futrelle, Mrs. Jacques Heath (Lily May Peel)     female    35.0      1   

4                                                    Allen, Mr. William Henry        male    35.0      0   

    Parch                       Ticket         Fare  Cabin Embarked  

0        0                   A/5 21171      7.2500   NaN        S  

1        0                    PC 17599   71.2833   C85         C  

2        0   STON/O2. 3101282      7.9250   NaN        S  

3        0                       113803    53.1000  C123       S  

4        0                      373450     8.0500   NaN        S  

Dataset Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

 #   Column           Non-Null Count    Dtype  

---  ------              --------------        -----  

 0   PassengerId      891 non-null       int64  

 1   Survived             891 non-null       int64  

 2   Pclass                891 non-null       int64  

 3   Name                 891 non-null      object 

 4   Sex                    891 non-null       object 

 5   Age                    714 non-null       float64

 6   SibSp                891 non-null        int64  

 7   Parch                 891 non-null       int64  

 8   Ticket                891 non-null       object 

 9   Fare                  891 non-null        float64

 10  Cabin               204 non-null       object 

 11  Embarked        889 non-null       object 

dtypes: float64(2), int64(5), object(5)

memory usage: 83.7+ KB

None

Missing Values:

PassengerId      0

Survived            0

Pclass                0

Name                 0

Sex                    0

Age                177

SibSp                0

Parch                0

Ticket               0

Fare                  0

Cabin           687

Embarked        2

dtype: int64

Summary Statistics:

             PassengerId     Survived         Pclass              Age               SibSp  \

count    891.000000  891.000000  891.000000  714.000000  891.000000   

mean    446.000000     0.383838      2.308642     29.699118      0.523008   

std        257.353842      0.486592      0.836071     14.526497        1.102743   

min            1.000000      0.000000      1.000000     0.420000       0.000000   

25%      223.500000      0.000000      2.000000    20.125000      0.000000   

50%      446.000000      0.000000      3.000000   28.000000      0.000000   

75%      668.500000      1.000000       3.000000   38.000000      1.000000   

max       891.000000      1.000000       3.000000   80.000000     8.000000   

                  Parch        Fare  

count  891.000000  891.000000  

mean     0.381594   32.204208  

std      0.806057   49.693429  

min      0.000000    0.000000  

25%      0.000000    7.910400  

50%      0.000000   14.454200  

75%      0.000000   31.000000  

max      6.000000  512.329200 

What does the output tell us?

The output tells us that:

  • More passengers did not survive (Survived = 0) than those who did (Survived = 1). Rough ratio: ~38% survived, ~62% did not.
  • A significantly higher proportion of females survived compared to males.
  • Survival was higher in the 1st class and dropped sharply in the 3rd class.
  • Most passengers were between 20–40 years old.
  • The cabin has the most missing values (687).
  • Age and Embarked also have missing data, but at manageable levels

Step 4: Data Preprocessing

Now that we have identified missing values and key variables, it’s time to clean the dataset and make it model-ready. In this step, we will:

  • Handle missing values
  • Encode categorical variables
  • Drop irrelevant columns
  • Scale or transform features, if needed

Use the code given below to achieve this:

# Import required libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Load the dataset again (if not already loaded)
df = pd.read_csv('/content/Titanic-Dataset.csv')
# Drop irrelevant columns
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Encode categorical variables
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
df['Sex'] = le_sex.fit_transform(df['Sex'])       # male=1, female=0
df['Embarked'] = le_embarked.fit_transform(df['Embarked'])  # S=2, C=0, Q=1
# Final dataset preview
df.head()

Output:

 

Survived

Pclass

Sex

Age

SibSp

Parch

Fare

Embarked

0

0

3

1

22.0

1

0

7.2500

2

1

1

1

0

38.0

1

0

71.2833

0

2

1

3

0

26.0

0

0

7.9250

2

3

1

1

0

35.0

1

0

53.1000

2

4

0

3

1

35.0

0

0

8.0500

2

 

 

Now we have cleaned and preprocessed the data. Let’s move ahead.

Step 5: Train and Evaluate ML Models

In this step, we will train below mentioned classification algorithms on the Titanic dataset to predict survival:

  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Decision Tree Classifier
  • Random Forest Classifier

Let’s train them all in a single script and compare their accuracy. Use the code below:

# Step 5: Train and Evaluate ML Models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Features and target
X = df.drop('Survived', axis=1)
y = df['Survived']
# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize models
models = {
    "Logistic Regression": LogisticRegression(max_iter=200),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100)
}
# Train and evaluate
for name, model in models.items():
    model.fit(X_train, y_train)                    # Train the model
    y_pred = model.predict(X_test)                 # Make predictions
    acc = accuracy_score(y_test, y_pred)           # Compute accuracy
    print(f"{name} Accuracy: {acc:.4f}")

Output:

Logistic Regression Accuracy: 0.8101
K-Nearest Neighbors Accuracy: 0.7039
Decision Tree Accuracy: 0.7709
Random Forest Accuracy: 0.8268

From the output, we can see that the Random Forest Classifier achieved the highest accuracy - 82.68%.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Conclusion

The Random Forest Classifier produced the best results out of all the models that were tested, with an astounding accuracy of 82.68%. Highest among the group. In the meantime, with a strong accuracy of 81.01%, Logistic Regression also did well. 

In contrast, the Decision Tree Classifier performed moderately well (77.09%), albeit with a slight propensity for overfitting. However, with the lowest accuracy of 70.39%, K-Nearest Neighbors (KNN) trailed behind. It might be as a result of the selection of "k" neighbors and their sensitivity to feature scaling.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link -
https://colab.research.google.com/drive/10WTMZomI4OhO8Hy7fBM_Y0tryLTUOySg

Frequently Asked Questions (FAQs)

1. What is the Titanic Dataset used for?

2. What features are included in the Titanic Dataset?

3. Which machine learning algorithms are suitable for this dataset?

4. What tools and libraries can be used for this project?

5. What are some project extensions beyond survival prediction?

Rohit Sharma

803 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months