Home
Blog
Data Science
Titanic Survival Prediction in R: Complete Guide with Code

Titanic Survival Prediction in R: Complete Guide with Code

Updated on Aug 05, 2025 | 1.7K+ views

Table of Contents

View all

How Much Time and Skill Does This Titanic Survival Prediction Project Require?
Key Concepts to Know Before Building This Titanic Survival Prediction Model in R
Libraries and Tools Used in This Titanic Survival Prediction Project
A Walkthrough of the Titanic Survival Prediction Project in R
Conclusion

This project focuses on predicting survival on the Titanic using R in Google Colab. It is designed for beginners who want to start learning R. This blog outlines all the steps involved, including setting up the environment, cleaning the dataset, building a Random Forest model, and evaluating its performance..

The Titanic dataset includes passenger details like age, sex, class, and embarkation point, which are used to predict survival outcomes. The project also includes a clear visualization of survival patterns by gender and class, making the insights more intuitive.

Go Beyond the Basics. Architect the AI-Powered World. These top data science programs don’t just teach tools; they cultivate visionaries. Learn to connect data with purpose, innovation with impact. Your journey to becoming an AI-first leader starts now.

Improve Your Programming Skills With The: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

How Much Time and Skill Does This Titanic Survival Prediction Project Require?

This Titanic survival prediction project is an entry-level project. The timeline and skill levels are mentioned in the table below.

Aspect	Details
Duration	2 to 4 hours (including setup and modeling)
Difficulty	Beginner-friendly
Skill Level	Basic knowledge of R and machine learning concepts

Lead the Future with Data, Not Just Follow It.

Gain cutting-edge expertise with industry-aligned programs, real-world projects, and certifications that speak globally.

Key Concepts to Know Before Building This Titanic Survival Prediction Model in R

Before starting this project, it’s helpful to be familiar with the following concepts:

R Basics: Understanding data frames, variables, and running code in R or Google Colab.
Data Cleaning: Ability to identify and handle missing or irrelevant data.
Data Types: Knowing how to work with categorical and numeric variables.
Classification Logic: Basic knowledge of how classification models work.
Fundamentals of Random Forest: Awareness of how decision trees combine to make predictions.
Evaluation Metrics: Interpreting confusion matrices, accuracy, recall, and precision.

Libraries and Tools Used in This Titanic Survival Prediction Project

The following libraries and tools come in handy while making this project. These ensure that the project runs smoothly.

Library/Tool	Purpose
tidyverse	Data manipulation (dplyr) and visualization (ggplot2)
caret	Data splitting, model training, and evaluation
randomForest	Building the Random Forest classification model
e1071	Provides additional machine learning functions (e.g., SVM)
Google Colab (R)	Cloud-based coding environment to run R code

A Walkthrough of the Titanic Survival Prediction Project in R

Below is a breakdown of the entire Titanic Survival Prediction project, ideal for beginners working in R using Google Colab. Each section will explain what is happening and include code with comments to help you understand the process better.

Step 1: Preparing Your Environment to Run R Code in Colab

Since this project is made using the R programming language, the first step is to ensure our Colab notebook is configured to interpret R code. Google Colab supports multiple languages, and selecting R helps us directly use R syntax, libraries, and visualizations without additional setup.

To do this, start a new Colab notebook and change the default runtime settings from Python to R. Once that’s done, save the changes, and we’re ready to move on with the next step.

Step 2: Install and Load the Required R Libraries

Before building the model, we need to install and load the core R libraries that will help us handle data, build machine learning models, and evaluate their performance. We need to install these packages only once. The code to install and load the libraries is given below:

# Install required packages (run only once)

install.packages("tidyverse")     # For data manipulation and visualization

install.packages("caret")         # For machine learning

install.packages("randomForest")  # Random forest model

install.packages("e1071")         # For SVM and other classifiers



# Load the packages into memory

library(tidyverse)                # Loads ggplot2, dplyr, etc.

library(caret)                    # Tools for model training and evaluation

library(randomForest)             # Random Forest algorithm

library(e1071)                    # Support Vector Machines and more

Once the libraries are loaded, you’re set to work with the dataset. The output for the above code is:

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

lift

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

combine

The following object is masked from ‘package:ggplot2’:

margin

Step 3: Import the Titanic Dataset

Once the libraries are loaded, the next step is to load the Titanic dataset into your R session. This dataset contains information about passengers, which we’ll use to train our survival prediction model. Viewing the top rows helps us confirm that the data was read correctly and gives an initial look at its structure. The code for this step is given below:

# Read the Titanic dataset

titanic_data <- read.csv("Titanic-Dataset.csv")  # Load the CSV file into a data frame


# View the first few rows of the data

head(titanic_data)  # Display the top rows to preview the dataset

The table below gives us a sneak peek of the dataset.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
	<int>	<int>	<int>	<chr>	<chr>	<dbl>	<int>	<int>	<chr>	<dbl>	<chr>	<chr>
1	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500		S
2	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.2833	C85	C
3	3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250		S
4	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
5	5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500		S
6	6	0	3	Moran, Mr. James	male	NA	0	0	330877	8.4583		Q

Here’s an R Project: How to Build an Uber Data Analysis Project in R

Step 4: Explore the Structure and Quality of the Data

Before we start cleaning or modeling, we need to see what the dataset looks like. In this step, we will check the data types, get summary statistics, and identify any missing values that may need attention later. The code for this step is:

# Check structure: column names, data types, and first few records

str(titanic_data)  # Helps understand the format of each column



# Summary statistics to understand numerical columns

summary(titanic_data)  # View min, max, mean, median, and NA counts



# Check for missing values in each column

colSums(is.na(titanic_data))  # Count how many NAs exist in each column

The output for this section is given below:

'data.frame': 891 obs. of 12 variables:

$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...

$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...

$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...

$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...

$ Sex : chr "male" "female" "female" "female" ...

$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...

$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...

$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...

$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...

$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...

$ Cabin : chr "" "C85" "" "C123" ...

$ Embarked : chr "S" "C" "S" "S" ...

PassengerId Survived Pclass Name

Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891

1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character

Median :446.0 Median :0.0000 Median :3.000 Mode :character

Mean :446.0 Mean :0.3838 Mean :2.309

3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000

Max. :891.0 Max. :1.0000 Max. :3.000

Sex Age SibSp Parch

Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000

Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000

Mode :character Median :28.00 Median :0.000 Median :0.0000

Mean :29.70 Mean :0.523 Mean :0.3816

3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000

Max. :80.00 Max. :8.000 Max. :6.0000

NA's :177

Ticket Fare Cabin Embarked

Length:891 Min. : 0.00 Length:891 Length:891

Class :character 1st Qu.: 7.91 Class :character Class :character

Mode :character Median : 14.45 Mode :character Mode :character

Mean : 32.20

3rd Qu.: 31.00

Max. :512.33

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 0 Embarked 0

The above output means that:

The dataset has 891 rows (passengers) and 12 columns (features like Age, Sex, Fare, etc.).
Columns have mixed data types: numeric (Age, Fare), integer (Survived, Pclass), and character (Name, Sex, Cabin).
The Age column has 177 missing values that need to be handled before modeling.

Step 5: Create a Copy of the Dataset for Cleaning

This step ensures we preserve the original dataset by working on a duplicate, so we can safely modify and preprocess the data. The code for this step is:

# Make a copy to work on

titanic_clean <- titanic_data   # Creating a duplicate of the dataset for cleaning



# View column names

names(titanic_clean)            # Displaying all column names in the dataset

The output for the above code gives us the column names of the dataset:

'PassengerId'
'Survived'
'Pclass'
'Name'
'Sex'
'Age'
'SibSp'
'Parch'
'Ticket'
'Fare'
'Cabin'
'Embarked'

Try Building This R Project: Wine Quality Prediction Project in R

Step 6: Drop Irrelevant Columns and Handle Missing Data

This step focuses on cleaning the dataset by removing non-informative columns and handling missing values in critical features like Age and Embarked. The code for this step is:

# Drop columns that are not useful for prediction

# (like PassengerId, Name, Ticket, Cabin)

titanic_clean <- titanic_clean %>%

  select(-PassengerId, -Name, -Ticket, -Cabin)



# Handle missing Age: replace with median age

titanic_clean$Age[is.na(titanic_clean$Age)] <- median(titanic_clean$Age, na.rm = TRUE)



# Handle missing Embarked: replace with mode (most common value)

titanic_clean$Embarked[is.na(titanic_clean$Embarked)] <- names(sort(table(titanic_clean$Embarked), decreasing = TRUE))[1]



# Check again for missing values

colSums(is.na(titanic_clean))  # Confirm no NA values remain

The above code cleans the data and removes irrelevant columns. The output for this step is:

Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0

Step 7: Convert Categorical Variables into Factor Types

In R, machine learning models perform better when categorical variables are explicitly treated as factors. This step will ensure that variables like Survived, Pclass, Sex, and Embarked are correctly formatted. The code for this step is:


# Convert categorical variables to factors

titanic_clean$Survived <- as.factor(titanic_clean$Survived)   # Target variable

titanic_clean$Pclass <- as.factor(titanic_clean$Pclass)       # Passenger class

titanic_clean$Sex <- as.factor(titanic_clean$Sex)             # Gender

titanic_clean$Embarked <- as.factor(titanic_clean$Embarked)   # Port of embarkation

The above step:

Converts key columns (Survived, Pclass, Sex, Embarked) into factors to represent categorical data.
Helps machine learning models understand these columns as categories, not numbers or strings.
Ensures correct data type handling for classification tasks.
Prepares the dataset for modeling steps like training a decision tree or random forest.

Step 8: Confirm Structure After Cleaning

This section helps verify that all previous cleaning steps have been correctly applied and the data is ready for modeling. Here’s the code:

# View structure again to confirm changes

str(titanic_clean)

The output shows us the changed dataset according to what we need for evaluation:

'data.frame': 891 obs. of 8 variables:

$ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...

$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...

$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...

$ Age : num 22 38 26 35 35 28 54 2 27 14 ...

$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...

$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...

$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...

$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Must Read: R For Data Science: Why Should You Choose R for Data Science?

Step 9: Split the Data into Training and Testing Sets

This step prepares the dataset for model training and evaluation by dividing it into two parts, one for training and one for testing. The code for this step is:

# Set seed to get the same result every time

set.seed(123)


# Create the split

train_index <- createDataPartition(titanic_clean$Survived, p = 0.7, list = FALSE)


# Split the data

train_data <- titanic_clean[train_index, ]

test_data <- titanic_clean[-train_index, ]

The above step:

Sets a random seed to ensure reproducible results.
Splits the dataset: 70% for training the model and 30% for testing it.
createDataPartition ensures that the distribution of the target variable (Survived) remains consistent in both sets.

Step 10: Train the Random Forest Model

In this step, we train a Random Forest classifier using the training data to predict whether a passenger survived or not. Here’s the code:

# Train the Random Forest model

set.seed(123)  # For reproducibility


rf_model <- randomForest(Survived ~ ., data = train_data, ntree = 100)


# View the model summary

print(rf_model)

The output for this code is:

Call:

randomForest(formula = Survived ~ ., data = train_data, ntree = 100)

Type of random forest: classification

Number of trees: 100

No. of variables tried at each split: 2

OOB estimate of error rate: 16.64%

Confusion matrix:

0 1 class.error

0 344 41 0.1064935

1 63 177 0.2625000

The above output shows that:

The model built 100 decision trees to classify whether a passenger survived or not.
It predicts correctly about 83% of the time (OOB error rate is 16.64%).
It performs better at predicting non-survivors (class 0) than survivors (class 1).
Survivors are harder to predict accurately, with about 26% being misclassified.

Also Read: 18 Types of Regression in Machine Learning You Should Know

Step 11: Making Predictions on Test Data

This step uses the trained Random Forest model to make survival predictions on the unseen test data. We also preview the first few predicted values. Here’s the code:

# Predict on test data using the trained model

predictions <- predict(rf_model, newdata = test_data)

# Show first few predictions to check output

head(predictions)

The output is given below:

Popular Data Science Programs

Advanced Certificate Program in Data Science DevOps Full Course Online PG Diploma in Data Science MSc in Data Science Program M Sc in Data Science Degree

The above output means that:

The output shows the first few predictions made by the model.
Each value (0 or 1) represents if a passenger did not survive (0) or survived (1).
The numbers on the left are just the row numbers.
“Levels: '0' '1'” means this is a factor variable with two possible outcomes.

Step 12: Evaluate the Model Using a Confusion Matrix

In this step, we check how well our Random Forest model performed by comparing its predictions with the actual survival values. Here’s the code:

# Confusion matrix to compare predictions vs actual outcomes

confusionMatrix(predictions, test_data$Survived)

Here’s the output of the above code:

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 142 31

1 22 71

Accuracy : 0.8008

95% CI : (0.7476, 0.847)

No Information Rate : 0.6165

P-Value [Acc > NIR] : 7.74e-11

Kappa : 0.5715

Mcnemar's Test P-Value : 0.2718

Sensitivity : 0.8659

Specificity : 0.6961

Pos Pred Value : 0.8208

Neg Pred Value : 0.7634

Prevalence : 0.6165

Detection Rate : 0.5338

Detection Prevalence : 0.6504

Balanced Accuracy : 0.7810

'Positive' Class : 0

The above output shows that:

Accuracy = 80%
This means the model correctly predicted survival for 80% of the test cases.
Sensitivity = 86.6%
Of all the people who actually did not survive (class 0), the model correctly identified 86.6% of them.
Specificity = 69.6%
Of all the people who actually survived (class 1), the model correctly predicted 69.6% of them.
Kappa = 0.5715
This shows moderate agreement between the model and the actual values, beyond chance — higher is better (range is -1 to 1).

Step 13: Visualize Survival by Gender and Passenger Class

This plot helps us visually explore survival patterns across gender and class. Here’s the code:

# Load ggplot2 (already part of tidyverse)

library(ggplot2)

# Plot: Survival by Gender and Pclass

ggplot(titanic_clean, aes(x = Sex, fill = Survived)) +

  geom_bar(position = "dodge") +               # Side-by-side bars for 0 and 1

  facet_wrap(~ Pclass) +                       # Separate plots for each passenger class

  labs(

    title = "Survival by Gender and Passenger Class",

    x = "Gender",

    y = "Count of Passengers",

    fill = "Survived"

  ) +

  theme_minimal()                              # Clean visual theme

The output for the above code will give us a graph as shown below:

The above graph shows that:

Each panel (1, 2, 3) represents a Passenger Class (Pclass) — 1st, 2nd, and 3rd class.
Bars are grouped by Gender, and color shows survival (0 = did not survive, 1 = survived).
Females had higher survival rates than males across all classes, especially in 1st and 2nd class.
3rd class males had the lowest survival, with a very high number of deaths (tall red bar).

Conclusion

In this Titanic Survival Prediction project, we used R in Google Colab to build a Random Forest classification model that predicts whether a passenger survived based on features like sex, age, passenger class (Pclass), and fare.

We split the dataset into 80% training and 20% testing sets, and after training, the model achieved an accuracy of 80.08%, with a sensitivity of 86.6% (correctly identifying survivors) and specificity of 69.6% (correctly identifying non-survivors).

Visual analysis showed that survival rates were higher for females in the 1st and 2nd classes, while males in the 3rd class had the lowest survival chances.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1-_OtIvSfpmqRnPYvKhBSVNvVua1nsbk7#scrollTo=olqlOoLY-PNS

Frequently Asked Questions (FAQs)

1. What is the purpose of the Titanic Survival Prediction project in R?

The project aims to predict passenger survival using machine learning techniques in R, offering hands-on experience with data preprocessing, classification modeling, and evaluation metrics.

2. Is R suitable for machine learning projects like survival prediction?

Yes, R is widely used for statistical modeling and machine learning. Its ecosystem includes powerful packages like caret, randomForest, and ggplot2, making it suitable for classification tasks like predicting survival.

3. What R libraries are used in this Titanic project and why?

tidyverse: for data wrangling and cleaning
caret: for training/testing workflows and model evaluation
randomForest: for building ensemble models
ggplot2: for creating informative visualizations
These libraries simplify building and validating machine learning models.

4. How does the Random Forest model work in this context?

Random Forest builds multiple decision trees and aggregates their outputs to improve prediction accuracy. In this project, it classifies passengers as ‘Survived’ or ‘Not Survived’ based on features like sex, class, and fare.

5. How do you evaluate model performance in R?

The confusionMatrix() function from the caret package provides accuracy, sensitivity, specificity, and other metrics to assess classification performance. Visualization tools like ggplot2 also help analyze feature impact.

6. Can beginners use Google Colab to run R code for this project?

Yes, Google Colab supports R using a simple setup. It's ideal for beginners who want a cloud-based environment to write, execute, and share R code without installing anything locally.

7. What are other beginner-friendly machine learning projects in R?

If you're just starting, here are some great options:

Forest Fire Analysis
Uber Data Analysis
Wine Quality Prediction
House Price Prediction
Spam Filter using Naive Bayes
Player Score Prediction in Sports

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources