Titanic Survival Prediction in R: Complete Guide with Code
By Rohit Sharma
Updated on Aug 05, 2025 | 1.26K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 05, 2025 | 1.26K+ views
Share:
Table of Contents
This project focuses on predicting survival on the Titanic using R in Google Colab. It is designed for beginners who want to start learning R. This blog outlines all the steps involved, including setting up the environment, cleaning the dataset, building a Random Forest model, and evaluating its performance..
The Titanic dataset includes passenger details like age, sex, class, and embarkation point, which are used to predict survival outcomes. The project also includes a clear visualization of survival patterns by gender and class, making the insights more intuitive.
Go Beyond the Basics. Architect the AI-Powered World. These top data science programs don’t just teach tools; they cultivate visionaries. Learn to connect data with purpose, innovation with impact. Your journey to becoming an AI-first leader starts now.
Improve Your Programming Skills With The: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
This Titanic survival prediction project is an entry-level project. The timeline and skill levels are mentioned in the table below.
Aspect |
Details |
Duration | 2 to 4 hours (including setup and modeling) |
Difficulty | Beginner-friendly |
Skill Level | Basic knowledge of R and machine learning concepts |
Lead the Future with Data, Not Just Follow It.
Gain cutting-edge expertise with industry-aligned programs, real-world projects, and certifications that speak globally.
Before starting this project, it’s helpful to be familiar with the following concepts:
The following libraries and tools come in handy while making this project. These ensure that the project runs smoothly.
Library/Tool |
Purpose |
tidyverse | Data manipulation (dplyr) and visualization (ggplot2) |
caret | Data splitting, model training, and evaluation |
randomForest | Building the Random Forest classification model |
e1071 | Provides additional machine learning functions (e.g., SVM) |
Google Colab (R) | Cloud-based coding environment to run R code |
Below is a breakdown of the entire Titanic Survival Prediction project, ideal for beginners working in R using Google Colab. Each section will explain what is happening and include code with comments to help you understand the process better.
Since this project is made using the R programming language, the first step is to ensure our Colab notebook is configured to interpret R code. Google Colab supports multiple languages, and selecting R helps us directly use R syntax, libraries, and visualizations without additional setup.
To do this, start a new Colab notebook and change the default runtime settings from Python to R. Once that’s done, save the changes, and we’re ready to move on with the next step.
Read More: Customer Segmentation Project Using R: A Step-by-Step Guide
Before building the model, we need to install and load the core R libraries that will help us handle data, build machine learning models, and evaluate their performance. We need to install these packages only once. The code to install and load the libraries is given below:
# Install required packages (run only once)
install.packages("tidyverse") # For data manipulation and visualization
install.packages("caret") # For machine learning
install.packages("randomForest") # Random forest model
install.packages("e1071") # For SVM and other classifiers
# Load the packages into memory
library(tidyverse) # Loads ggplot2, dplyr, etc.
library(caret) # Tools for model training and evaluation
library(randomForest) # Random Forest algorithm
library(e1071) # Support Vector Machines and more
Once the libraries are loaded, you’re set to work with the dataset. The output for the above code is:
Installing package into ‘/usr/local/lib/R/site-library’
Installing package into ‘/usr/local/lib/R/site-library’
also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’
Installing package into ‘/usr/local/lib/R/site-library’
Installing package into ‘/usr/local/lib/R/site-library’
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’: lift
randomForest 4.7-1.2 Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:dplyr’: combine
The following object is masked from ‘package:ggplot2’: margin |
Once the libraries are loaded, the next step is to load the Titanic dataset into your R session. This dataset contains information about passengers, which we’ll use to train our survival prediction model. Viewing the top rows helps us confirm that the data was read correctly and gives an initial look at its structure. The code for this step is given below:
# Read the Titanic dataset
titanic_data <- read.csv("Titanic-Dataset.csv") # Load the CSV file into a data frame
# View the first few rows of the data
head(titanic_data) # Display the top rows to preview the dataset
The table below gives us a sneak peek of the dataset.
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
|
<int> |
<int> |
<int> |
<chr> |
<chr> |
<dbl> |
<int> |
<int> |
<chr> |
<dbl> |
<chr> |
<chr> |
|
1 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22 |
1 |
0 |
A/5 21171 |
7.2500 |
S |
|
2 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
female |
38 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
S |
|
4 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35 |
0 |
0 |
373450 |
8.0500 |
S |
|
6 |
6 |
0 |
3 |
Moran, Mr. James |
male |
NA |
0 |
0 |
330877 |
8.4583 |
Q |
Here’s an R Project: How to Build an Uber Data Analysis Project in R
Before we start cleaning or modeling, we need to see what the dataset looks like. In this step, we will check the data types, get summary statistics, and identify any missing values that may need attention later. The code for this step is:
# Check structure: column names, data types, and first few records
str(titanic_data) # Helps understand the format of each column
# Summary statistics to understand numerical columns
summary(titanic_data) # View min, max, mean, median, and NA counts
# Check for missing values in each column
colSums(is.na(titanic_data)) # Count how many NAs exist in each column
The output for this section is given below:
'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... $ Sex : chr "male" "female" "female" "female" ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : chr "" "C85" "" "C123" ... $ Embarked : chr "S" "C" "S" "S" ...
PassengerId Survived Pclass Name Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character Median :446.0 Median :0.0000 Median :3.000 Mode :character Mean :446.0 Mean :0.3838 Mean :2.309 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000 Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000 Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Mode :character Median :28.00 Median :0.000 Median :0.0000 Mean :29.70 Mean :0.523 Mean :0.3816 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 Max. :80.00 Max. :8.000 Max. :6.0000 NA's :177 Ticket Fare Cabin Embarked Length:891 Min. : 0.00 Length:891 Length:891 Class :character 1st Qu.: 7.91 Class :character Class :character Mode :character Median : 14.45 Mode :character Mode :character Mean : 32.20 3rd Qu.: 31.00 Max. :512.33
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 0 Embarked 0 |
The above output means that:
This step ensures we preserve the original dataset by working on a duplicate, so we can safely modify and preprocess the data. The code for this step is:
# Make a copy to work on
titanic_clean <- titanic_data # Creating a duplicate of the dataset for cleaning
# View column names
names(titanic_clean) # Displaying all column names in the dataset
The output for the above code gives us the column names of the dataset:
|
Try Building This R Project: Wine Quality Prediction Project in R
This step focuses on cleaning the dataset by removing non-informative columns and handling missing values in critical features like Age and Embarked. The code for this step is:
# Drop columns that are not useful for prediction
# (like PassengerId, Name, Ticket, Cabin)
titanic_clean <- titanic_clean %>%
select(-PassengerId, -Name, -Ticket, -Cabin)
# Handle missing Age: replace with median age
titanic_clean$Age[is.na(titanic_clean$Age)] <- median(titanic_clean$Age, na.rm = TRUE)
# Handle missing Embarked: replace with mode (most common value)
titanic_clean$Embarked[is.na(titanic_clean$Embarked)] <- names(sort(table(titanic_clean$Embarked), decreasing = TRUE))[1]
# Check again for missing values
colSums(is.na(titanic_clean)) # Confirm no NA values remain
The above code cleans the data and removes irrelevant columns. The output for this step is:
Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0 |
In R, machine learning models perform better when categorical variables are explicitly treated as factors. This step will ensure that variables like Survived, Pclass, Sex, and Embarked are correctly formatted. The code for this step is:
# Convert categorical variables to factors
titanic_clean$Survived <- as.factor(titanic_clean$Survived) # Target variable
titanic_clean$Pclass <- as.factor(titanic_clean$Pclass) # Passenger class
titanic_clean$Sex <- as.factor(titanic_clean$Sex) # Gender
titanic_clean$Embarked <- as.factor(titanic_clean$Embarked) # Port of embarkation
The above step:
This section helps verify that all previous cleaning steps have been correctly applied and the data is ready for modeling. Here’s the code:
# View structure again to confirm changes
str(titanic_clean)
The output shows us the changed dataset according to what we need for evaluation:
'data.frame': 891 obs. of 8 variables: $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ... $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 28 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ... |
Must Read: R For Data Science: Why Should You Choose R for Data Science?
This step prepares the dataset for model training and evaluation by dividing it into two parts, one for training and one for testing. The code for this step is:
# Set seed to get the same result every time
set.seed(123)
# Create the split
train_index <- createDataPartition(titanic_clean$Survived, p = 0.7, list = FALSE)
# Split the data
train_data <- titanic_clean[train_index, ]
test_data <- titanic_clean[-train_index, ]
The above step:
In this step, we train a Random Forest classifier using the training data to predict whether a passenger survived or not. Here’s the code:
# Train the Random Forest model
set.seed(123) # For reproducibility
rf_model <- randomForest(Survived ~ ., data = train_data, ntree = 100)
# View the model summary
print(rf_model)
The output for this code is:
Call: randomForest(formula = Survived ~ ., data = train_data, ntree = 100) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 2 OOB estimate of error rate: 16.64% Confusion matrix: 0 1 class.error 0 344 41 0.1064935 1 63 177 0.2625000 |
The above output shows that:
Also Read: 18 Types of Regression in Machine Learning You Should Know
This step uses the trained Random Forest model to make survival predictions on the unseen test data. We also preview the first few predicted values. Here’s the code:
# Predict on test data using the trained model
predictions <- predict(rf_model, newdata = test_data)
# Show first few predictions to check output
head(predictions)
The output is given below:
Popular Data Science Programs
The above output means that:
In this step, we check how well our Random Forest model performed by comparing its predictions with the actual survival values. Here’s the code:
# Confusion matrix to compare predictions vs actual outcomes
confusionMatrix(predictions, test_data$Survived)
Here’s the output of the above code:
Confusion Matrix and Statistics
Reference Prediction 0 1 0 142 31 1 22 71
Accuracy : 0.8008 95% CI : (0.7476, 0.847) No Information Rate : 0.6165 P-Value [Acc > NIR] : 7.74e-11
Kappa : 0.5715
Mcnemar's Test P-Value : 0.2718
Sensitivity : 0.8659 Specificity : 0.6961 Pos Pred Value : 0.8208 Neg Pred Value : 0.7634 Prevalence : 0.6165 Detection Rate : 0.5338 Detection Prevalence : 0.6504 Balanced Accuracy : 0.7810
'Positive' Class : 0
|
The above output shows that:
This plot helps us visually explore survival patterns across gender and class. Here’s the code:
# Load ggplot2 (already part of tidyverse)
library(ggplot2)
# Plot: Survival by Gender and Pclass
ggplot(titanic_clean, aes(x = Sex, fill = Survived)) +
geom_bar(position = "dodge") + # Side-by-side bars for 0 and 1
facet_wrap(~ Pclass) + # Separate plots for each passenger class
labs(
title = "Survival by Gender and Passenger Class",
x = "Gender",
y = "Count of Passengers",
fill = "Survived"
) +
theme_minimal() # Clean visual theme
The output for the above code will give us a graph as shown below:
The above graph shows that:
In this Titanic Survival Prediction project, we used R in Google Colab to build a Random Forest classification model that predicts whether a passenger survived based on features like sex, age, passenger class (Pclass), and fare.
We split the dataset into 80% training and 20% testing sets, and after training, the model achieved an accuracy of 80.08%, with a sensitivity of 86.6% (correctly identifying survivors) and specificity of 69.6% (correctly identifying non-survivors).
Visual analysis showed that survival rates were higher for females in the 1st and 2nd classes, while males in the 3rd class had the lowest survival chances.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1-_OtIvSfpmqRnPYvKhBSVNvVua1nsbk7#scrollTo=olqlOoLY-PNS
817 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources