Titanic Survival Prediction in R: Complete Guide with Code

By Rohit Sharma

Updated on Aug 05, 2025 | 1.26K+ views

Share:

This project focuses on predicting survival on the Titanic using R in Google Colab. It is designed for beginners who want to start learning R. This blog outlines all the steps involved, including setting up the environment, cleaning the dataset, building a Random Forest model, and evaluating its performance.. 

The Titanic dataset includes passenger details like age, sex, class, and embarkation point, which are used to predict survival outcomes. The project also includes a clear visualization of survival patterns by gender and class, making the insights more intuitive.

Go Beyond the Basics. Architect the AI-Powered World. These top data science programs don’t just teach tools; they cultivate visionaries. Learn to connect data with purpose, innovation with impact. Your journey to becoming an AI-first leader starts now.

Improve Your Programming Skills With The: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

How Much Time and Skill Does This Titanic Survival Prediction Project Require?

This Titanic survival prediction project is an entry-level project. The timeline and skill levels are mentioned in the table below.

Aspect

Details

Duration 2 to 4 hours (including setup and modeling)
Difficulty Beginner-friendly
Skill Level Basic knowledge of R and machine learning concepts

Lead the Future with Data, Not Just Follow It.

Gain cutting-edge expertise with industry-aligned programs, real-world projects, and certifications that speak globally. 

Key Concepts to Know Before Building This Titanic Survival Prediction Model in R

Before starting this project, it’s helpful to be familiar with the following concepts:

  • R Basics: Understanding data frames, variables, and running code in R or Google Colab.
  • Data Cleaning: Ability to identify and handle missing or irrelevant data.
  • Data Types: Knowing how to work with categorical and numeric variables.
  • Classification Logic: Basic knowledge of how classification models work.
  • Fundamentals of Random Forest: Awareness of how decision trees combine to make predictions.
  • Evaluation Metrics: Interpreting confusion matrices, accuracy, recall, and precision.

Libraries and Tools Used in This Titanic Survival Prediction Project

The following libraries and tools come in handy while making this project. These ensure that the project runs smoothly.

Library/Tool

Purpose

tidyverse Data manipulation (dplyr) and visualization (ggplot2)
caret Data splitting, model training, and evaluation
randomForest Building the Random Forest classification model
e1071 Provides additional machine learning functions (e.g., SVM)
Google Colab (R) Cloud-based coding environment to run R code

A Walkthrough of the Titanic Survival Prediction Project in R

Below is a breakdown of the entire Titanic Survival Prediction project, ideal for beginners working in R using Google Colab. Each section will explain what is happening and include code with comments to help you understand the process better.

Step 1: Preparing Your Environment to Run R Code in Colab

Since this project is made using the R programming language, the first step is to ensure our Colab notebook is configured to interpret R code. Google Colab supports multiple languages, and selecting R helps us directly use R syntax, libraries, and visualizations without additional setup.

To do this, start a new Colab notebook and change the default runtime settings from Python to R. Once that’s done, save the changes, and we’re ready to move on with the next step. 

Read More: Customer Segmentation Project Using R: A Step-by-Step Guide

Step 2: Install and Load the Required R Libraries

Before building the model, we need to install and load the core R libraries that will help us handle data, build machine learning models, and evaluate their performance. We need to install these packages only once. The code to install and load the libraries is given below:

# Install required packages (run only once)

install.packages("tidyverse")     # For data manipulation and visualization

install.packages("caret")         # For machine learning

install.packages("randomForest")  # Random forest model

install.packages("e1071")         # For SVM and other classifiers



# Load the packages into memory

library(tidyverse)                # Loads ggplot2, dplyr, etc.

library(caret)                    # Tools for model training and evaluation

library(randomForest)             # Random Forest algorithm

library(e1071)                    # Support Vector Machines and more

Once the libraries are loaded, you’re set to work with the dataset. The output for the above code is:

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

 

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

 

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’

 

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

 

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

 

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

 dplyr    1.1.4      readr    2.1.5

 forcats  1.0.0      stringr  1.5.1

 ggplot2  3.5.2      tibble   3.3.0

 lubridate 1.9.4      tidyr    1.3.1

 purrr    1.1.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

 dplyr::filter() masks stats::filter()

 dplyr::lag()    masks stats::lag()

Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: lattice

 

Attaching package: ‘caret’

 

The following object is masked from ‘package:purrr’:

    lift

 

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

 

Attaching package: ‘randomForest’

 

The following object is masked from ‘package:dplyr’:

    combine

 

The following object is masked from ‘package:ggplot2’:

    margin

Step 3: Import the Titanic Dataset

Once the libraries are loaded, the next step is to load the Titanic dataset into your R session. This dataset contains information about passengers, which we’ll use to train our survival prediction model. Viewing the top rows helps us confirm that the data was read correctly and gives an initial look at its structure. The code for this step is given below:

# Read the Titanic dataset

titanic_data <- read.csv("Titanic-Dataset.csv")  # Load the CSV file into a data frame


# View the first few rows of the data

head(titanic_data)  # Display the top rows to preview the dataset

The table below gives us a sneak peek of the dataset.

 

PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

 

<int>

<int>

<int>

<chr>

<chr>

<dbl>

<int>

<int>

<chr>

<dbl>

<chr>

<chr>

1

1

0

3

Braund, Mr. Owen Harris

male

22

1

0

A/5 21171

7.2500

 

S

2

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Thayer)

female

38

1

0

PC 17599

71.2833

C85

C

3

3

1

3

Heikkinen, Miss. Laina

female

26

0

0

STON/O2. 3101282

7.9250

 

S

4

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35

1

0

113803

53.1000

C123

S

5

5

0

3

Allen, Mr. William Henry

male

35

0

0

373450

8.0500

 

S

6

6

0

3

Moran, Mr. James

male

NA

0

0

330877

8.4583

 

Q


Here’s an R Project: How to Build an Uber Data Analysis Project in R

Step 4: Explore the Structure and Quality of the Data

Before we start cleaning or modeling, we need to see what the dataset looks like. In this step, we will check the data types, get summary statistics, and identify any missing values that may need attention later. The code for this step is:

# Check structure: column names, data types, and first few records

str(titanic_data)  # Helps understand the format of each column



# Summary statistics to understand numerical columns

summary(titanic_data)  # View min, max, mean, median, and NA counts



# Check for missing values in each column

colSums(is.na(titanic_data))  # Count how many NAs exist in each column

The output for this section is given below:

'data.frame': 891 obs. of  12 variables:

 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...

 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...

 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...

 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...

 $ Sex        : chr  "male" "female" "female" "female" ...

 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...

 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...

 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...

 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...

 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...

 $ Cabin      : chr  "" "C85" "" "C123" ...

 $ Embarked   : chr  "S" "C" "S" "S" ...

 

 PassengerId         Survived               Pclass              Name          

 Min. :  1.0           Min.   :0.0000        Min.   :1.000        Length:891        

1st Qu.:223.5     1st Qu.:0.0000      1st Qu.:2.000       Class :character  

Median :446.0   Median :0.0000    Median :3.000      Mode  :character  

            Mean   :446.0    Mean   :0.3838     Mean   :2.309                     

            3rd Qu.:668.5    3rd Qu.:1.0000     3rd Qu.:3.000                     

            Max.   :891.0      Max.   :1.0000       Max.   :3.000                     

                                                                    

     Sex                     Age                   SibSp                 Parch       

Length:891             Min.   : 0.42       Min.   :0.000      Min.   :0.0000  

Class :character     1st Qu.:20.12     1st Qu.:0.000     1st Qu.:0.0000  

Mode  :character    Median :28.00   Median :0.000   Median :0.0000  

                    Mean   :29.70   Mean   :0.523   Mean   :0.3816  

                    3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  

                    Max.   :80.00   Max.   :8.000   Max.   :6.0000  

                    NA's   :177                                     

    Ticket               Fare           Cabin             Embarked        

 Length:891         Min.   :  0.00   Length:891         Length:891        

 Class :character   1st Qu.:  7.91   Class :character   Class :character  

 Mode  :character   Median : 14.45   Mode  :character   Mode  :character  

                    Mean   : 32.20                                        

                    3rd Qu.: 31.00                                        

                    Max.   :512.33                                        

                                                                       

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 0 Embarked 0

The above output means that:

  • The dataset has 891 rows (passengers) and 12 columns (features like Age, Sex, Fare, etc.).
  • Columns have mixed data types: numeric (Age, Fare), integer (Survived, Pclass), and character (Name, Sex, Cabin).
  • The Age column has 177 missing values that need to be handled before modeling.

Step 5: Create a Copy of the Dataset for Cleaning

This step ensures we preserve the original dataset by working on a duplicate, so we can safely modify and preprocess the data. The code for this step is:

# Make a copy to work on

titanic_clean <- titanic_data   # Creating a duplicate of the dataset for cleaning



# View column names

names(titanic_clean)            # Displaying all column names in the dataset

The output for the above code gives us the column names of the dataset:

  • 'PassengerId'
  • 'Survived'
  • 'Pclass'
  • 'Name'
  • 'Sex'
  • 'Age'
  • 'SibSp'
  • 'Parch'
  • 'Ticket'
  • 'Fare'
  • 'Cabin'
  • 'Embarked'

Try Building This R Project: Wine Quality Prediction Project in R

Step 6: Drop Irrelevant Columns and Handle Missing Data

This step focuses on cleaning the dataset by removing non-informative columns and handling missing values in critical features like Age and Embarked. The code for this step is:

# Drop columns that are not useful for prediction

# (like PassengerId, Name, Ticket, Cabin)

titanic_clean <- titanic_clean %>%

  select(-PassengerId, -Name, -Ticket, -Cabin)



# Handle missing Age: replace with median age

titanic_clean$Age[is.na(titanic_clean$Age)] <- median(titanic_clean$Age, na.rm = TRUE)



# Handle missing Embarked: replace with mode (most common value)

titanic_clean$Embarked[is.na(titanic_clean$Embarked)] <- names(sort(table(titanic_clean$Embarked), decreasing = TRUE))[1]



# Check again for missing values

colSums(is.na(titanic_clean))  # Confirm no NA values remain

The above code cleans the data and removes irrelevant columns. The output for this step is:

Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0

Step 7: Convert Categorical Variables into Factor Types

In R, machine learning models perform better when categorical variables are explicitly treated as factors. This step will ensure that variables like Survived, Pclass, Sex, and Embarked are correctly formatted. The code for this step is:


# Convert categorical variables to factors

titanic_clean$Survived <- as.factor(titanic_clean$Survived)   # Target variable

titanic_clean$Pclass <- as.factor(titanic_clean$Pclass)       # Passenger class

titanic_clean$Sex <- as.factor(titanic_clean$Sex)             # Gender

titanic_clean$Embarked <- as.factor(titanic_clean$Embarked)   # Port of embarkation

The above step:

  • Converts key columns (Survived, Pclass, Sex, Embarked) into factors to represent categorical data.
  • Helps machine learning models understand these columns as categories, not numbers or strings.
  • Ensures correct data type handling for classification tasks.
  • Prepares the dataset for modeling steps like training a decision tree or random forest.

Step 8: Confirm Structure After Cleaning

This section helps verify that all previous cleaning steps have been correctly applied and the data is ready for modeling. Here’s the code:

# View structure again to confirm changes

str(titanic_clean)

The output shows us the changed dataset according to what we need for evaluation:

'data.frame': 891 obs. of  8 variables:

 $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...

 $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...

 $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...

 $ Age     : num  22 38 26 35 35 28 54 2 27 14 ...

 $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...

 $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...

 $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...

 $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Must Read: R For Data Science: Why Should You Choose R for Data Science?

Step 9: Split the Data into Training and Testing Sets

This step prepares the dataset for model training and evaluation by dividing it into two parts, one for training and one for testing. The code for this step is:

# Set seed to get the same result every time

set.seed(123)


# Create the split

train_index <- createDataPartition(titanic_clean$Survived, p = 0.7, list = FALSE)


# Split the data

train_data <- titanic_clean[train_index, ]

test_data <- titanic_clean[-train_index, ]

The above step:

  • Sets a random seed to ensure reproducible results.
  • Splits the dataset: 70% for training the model and 30% for testing it.
  • createDataPartition ensures that the distribution of the target variable (Survived) remains consistent in both sets.

Step 10: Train the Random Forest Model

In this step, we train a Random Forest classifier using the training data to predict whether a passenger survived or not. Here’s the code:

# Train the Random Forest model

set.seed(123)  # For reproducibility


rf_model <- randomForest(Survived ~ ., data = train_data, ntree = 100)


# View the model summary

print(rf_model)

The output for this code is:

Call:

 randomForest(formula = Survived ~ ., data = train_data, ntree = 100) 

                   Type of random forest: classification

                             Number of trees: 100

No. of variables tried at each split: 2

           OOB estimate of  error rate: 16.64%

Confusion matrix:

         0      1   class.error

0  344    41   0.1064935

1     63  177   0.2625000

The above output shows that:

  • The model built 100 decision trees to classify whether a passenger survived or not.
  • It predicts correctly about 83% of the time (OOB error rate is 16.64%).
  • It performs better at predicting non-survivors (class 0) than survivors (class 1).
  • Survivors are harder to predict accurately, with about 26% being misclassified.

Also Read: 18 Types of Regression in Machine Learning You Should Know

Step 11: Making Predictions on Test Data

This step uses the trained Random Forest model to make survival predictions on the unseen test data. We also preview the first few predicted values. Here’s the code:

# Predict on test data using the trained model

predictions <- predict(rf_model, newdata = test_data)

# Show first few predictions to check output

head(predictions)

The output is given below:

The above output means that:

  • The output shows the first few predictions made by the model.
  • Each value (0 or 1) represents if a passenger did not survive (0) or survived (1).
  • The numbers on the left are just the row numbers.
  • “Levels: '0' '1'” means this is a factor variable with two possible outcomes.

Step 12: Evaluate the Model Using a Confusion Matrix

In this step, we check how well our Random Forest model performed by comparing its predictions with the actual survival values. Here’s the code:

# Confusion matrix to compare predictions vs actual outcomes

confusionMatrix(predictions, test_data$Survived)

Here’s the output of the above code:

Confusion Matrix and Statistics

 

                 Reference

Prediction       0    1

               0   142  31

               1     22  71

                                         

                        Accuracy : 0.8008         

                            95% CI : (0.7476, 0.847)

       No Information Rate : 0.6165         

       P-Value [Acc > NIR] : 7.74e-11       

                                         

                              Kappa : 0.5715         

                                         

 Mcnemar's Test P-Value : 0.2718         

                                         

                       Sensitivity : 0.8659         

                       Specificity : 0.6961         

               Pos Pred Value : 0.8208         

              Neg Pred Value : 0.7634         

                     Prevalence : 0.6165         

               Detection Rate : 0.5338         

    Detection Prevalence : 0.6504         

        Balanced Accuracy : 0.7810         

                                         

                'Positive' Class : 0 

 

The above output shows that:

  • Accuracy = 80%
    This means the model correctly predicted survival for 80% of the test cases.
  • Sensitivity = 86.6%
    Of all the people who actually did not survive (class 0), the model correctly identified 86.6% of them.
  • Specificity = 69.6%
    Of all the people who actually survived (class 1), the model correctly predicted 69.6% of them.
  • Kappa = 0.5715
    This shows moderate agreement between the model and the actual values, beyond chance — higher is better (range is -1 to 1).

Step 13: Visualize Survival by Gender and Passenger Class

This plot helps us visually explore survival patterns across gender and class. Here’s the code:

# Load ggplot2 (already part of tidyverse)

library(ggplot2)

# Plot: Survival by Gender and Pclass

ggplot(titanic_clean, aes(x = Sex, fill = Survived)) +

  geom_bar(position = "dodge") +               # Side-by-side bars for 0 and 1

  facet_wrap(~ Pclass) +                       # Separate plots for each passenger class

  labs(

    title = "Survival by Gender and Passenger Class",

    x = "Gender",

    y = "Count of Passengers",

    fill = "Survived"

  ) +

  theme_minimal()                              # Clean visual theme

The output for the above code will give us a graph as shown below:

The above graph shows that:

  • Each panel (1, 2, 3) represents a Passenger Class (Pclass) — 1st, 2nd, and 3rd class.
  • Bars are grouped by Gender, and color shows survival (0 = did not survive, 1 = survived).
  • Females had higher survival rates than males across all classes, especially in 1st and 2nd class.
  • 3rd class males had the lowest survival, with a very high number of deaths (tall red bar).

Conclusion

In this Titanic Survival Prediction project, we used R in Google Colab to build a Random Forest classification model that predicts whether a passenger survived based on features like sex, age, passenger class (Pclass), and fare.

We split the dataset into 80% training and 20% testing sets, and after training, the model achieved an accuracy of 80.08%, with a sensitivity of 86.6% (correctly identifying survivors) and specificity of 69.6% (correctly identifying non-survivors).

Visual analysis showed that survival rates were higher for females in the 1st and 2nd classes, while males in the 3rd class had the lowest survival chances.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1-_OtIvSfpmqRnPYvKhBSVNvVua1nsbk7#scrollTo=olqlOoLY-PNS

Frequently Asked Questions (FAQs)

1. What is the purpose of the Titanic Survival Prediction project in R?

2. Is R suitable for machine learning projects like survival prediction?

3. What R libraries are used in this Titanic project and why?

4. How does the Random Forest model work in this context?

5. How do you evaluate model performance in R?

6. Can beginners use Google Colab to run R code for this project?

7. What are other beginner-friendly machine learning projects in R?

Rohit Sharma

817 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months