Car Data Analysis Project Using R
By Rohit Sharma
Updated on Jul 28, 2025 | 16 min read | 1.49K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 28, 2025 | 16 min read | 1.49K+ views
Share:
Table of Contents
This car data analysis project will be built using R. In this project, you'll be analyzing the dataset using R programming in Google Colab. This blog will explain each step and provide the code along with the output.
In this car data analysis project, we will apply various techniques, including data cleaning, exploratory analysis, statistical visualization with ggplot2, correlation analysis, and predictive modeling.
This project will include visualizations, provide insights about automotive performance patterns, and more.
Ready to Future-Proof Your Career with Data Science? Join upGrad’s expert-led online data science courses and master tools like Python, AI, and Machine Learning. Start learning today and build the skills top employers demand!
Here Are the Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025 to build your skills.
Before starting the car data analysis project, it’s helpful to have a few basic skills to work smoothly through the analysis and modeling process:
Ready to Level Up Your Career in Data Science and AI?
Join top-tier programs in Artificial Intelligence, Generative AI, and Data Science designed for future-focused professionals. Start learning from the best. Apply Now!
The tools and R libraries used in this car data analysis project are mentioned in the table below. These help in executing and running of the project smoothly.
Tool/Library | Purpose | Why It's Used |
Google Colab | Cloud-based R environment | No local installation required, free access, easy sharing |
R Programming Language | Statistical computing and analysis | Industry-standard for data science and statistical analysis |
CSV Files | Data storage format | Simple, universal format for storing tabular data |
tidyverse | Data manipulation toolkit | Data cleaning, filtering, and transformation |
ggplot2 | Static data visualization | Creating histograms, scatter plots, and box plots |
corrplot | Correlation visualization | Generating correlation heatmaps and matrices |
knitr | Document formatting | Creating formatted tables and reports |
DT | Interactive data tables | Displaying datasets in user-friendly format |
Ready to Begin Your Analytics Journey?
Kickstart your career with our free Introduction to Data Analysis using Excel course. Learn how to clean, analyze, and visualize data like a pro—no prior experience needed!
The overall duration of the project is 3-5 hours. The breakdown of the timeline of the project is given below. Although that may vary depending on your skillset.
Phase | Time Required | Details |
Setup & Installation | 15-20 minutes | Installing libraries, setting up Google Colab |
Core Analysis | 2-3 hours | Data cleaning, visualization, correlation analysis |
Advanced Features | 1-2 hours | Predictive modeling, report generation |
Total Project Time | 3-5 hours | Complete end-to-end implementation |
Difficulty Level
Beginner to Intermediate
Read This: Benefits of Learning R: Why It’s Essential for Data Science
In this section, we’ll provide the breakdown of the car data analysis project with the codes for each step and the corresponding output.
We need to configure Google Colab to support the R language instead of Python, as Google Colab by default runs on Python. This lets us run R scripts directly in the Colab notebook.
Here’s How You Can Do It:
In this step, we’ll install and load the essential libraries in R that’ll be used for this project. The code for this step is given below:
# Step 2: Install and load essential libraries
# Think of libraries as toolboxes – each one gives us special functions!
# Install packages (only need to do this once)
install.packages(c("tidyverse", "ggplot2", "corrplot", "plotly", "knitr", "DT"))
# Load the libraries (do this every time you start)
library(tidyverse) # Swiss army knife for data manipulation
library(ggplot2) # Creates beautiful charts and graphs
library(corrplot) # Makes correlation heatmaps
library(plotly) # Interactive visualizations
library(knitr) # Pretty table formatting
library(DT) # Interactive data tables
# Print success message
cat("All libraries loaded successfully! Ready to explore data!\n")
After all the libraries are installed and loaded, we’ll get the output like this:
Installing packages into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘lazyeval’, ‘crosstalk’ ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors corrplot 0.95 loaded Attaching package: ‘plotly’ The following object is masked from ‘package:ggplot2’: last_plot The following object is masked from ‘package:stats’: filter The following object is masked from ‘package:graphics’: layout All libraries loaded successfully! Ready to explore data! |
Must Read: Best R Libraries Data Science: Tools for Analysis, Visualization & ML
Now that all the libraries are installed and loaded, it’s time for us to upload and read the dataset that we’ll work with. The code for this step is:
# Step 3: Upload your dataset to Google Colab
# Click the folder icon on the left sidebar, then upload your mtcars.csv file
# Load the dataset into R
# Think of this as opening your Excel file in R
mtcars_data <- read.csv("mtcars.csv")
# Let's take a first look at our data
cat("📊 Dataset loaded! Here's what we have:\n")
print(paste("Number of cars:", nrow(mtcars_data))) # Prints number of rows (cars)
print(paste("Number of features:", ncol(mtcars_data))) # Prints number of columns (features)
# Display the first few rows (like previewing a book)
head(mtcars_data)
Popular Data Science Programs
The output of the above code will give us a glimpse of how the dataset looks like:
Dataset loaded! Here's what we have: [1] "Number of cars: 32"
|
Also Read: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!
Before starting the analysis, we need to understand what the data in each column means. In this step, we’ll identify the column names and build a simple data dictionary to describe them. The code for this section is given below:
# Step 4: Get to know your data – like meeting new friends!
# What columns do we have?
cat("Column names in our dataset:\n")
colnames(mtcars_data) # Prints all column names
# What do these columns mean? Let's create a data dictionary
data_dictionary <- data.frame(
Column = c("model", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"),
Description = c("Car model name",
"Miles per gallon (fuel efficiency)",
"Number of cylinders",
"Engine displacement (cubic inches)",
"Horsepower",
"Rear axle ratio",
"Weight (1000 lbs)",
"Quarter mile time (seconds)",
"Engine shape (0=V-shaped, 1=straight)",
"Transmission (0=automatic, 1=manual)",
"Number of gears",
"Number of carburetors")
)
# Display our data dictionary
knitr::kable(data_dictionary, caption = "📖 What Each Column Means") # Nicely formats and displays the dictionary as a table
This step gives the output that describes each column.
Column names in our dataset: 'Model' 'mpg' ' cyl' ' disp' 'hp' 'drat' 'wt' 'qsec' 'vs' 'am' 'gear' 'carb'
|
In this step, we’ll organize and format the data before we begin analysis. We need to check the data for missing values and also ensure that the data types are correct. The code for cleaning the data is given below:
# Step 5: Clean our data – like organizing your room!
# Check for missing values (empty cells)
cat("Checking for missing data:\n")
missing_data <- sum(is.na(mtcars_data)) # Count total missing values
print(paste("Total missing values:", missing_data)) # Print the count
# Look at the structure of our data
str(mtcars_data) # Shows data types and column structure
# Convert categorical variables to factors (R's way of handling categories)
mtcars_data$vs <- factor(mtcars_data$vs, labels = c("V-shaped", "Straight")) # Engine shape as labels
mtcars_data$am <- factor(mtcars_data$am, labels = c("Automatic", "Manual")) # Transmission type as labels
mtcars_data$cyl <- factor(mtcars_data$cyl) # Convert cylinder count to category
mtcars_data$gear <- factor(mtcars_data$gear) # Convert number of gears to category
mtcars_data$carb <- factor(mtcars_data$carb) # Convert carburetors to category
cat("Data cleaning complete! Variables are properly formatted.\n") # Confirmation message
After running this code and checking and cleaning the data, we get the output as follows:
Checking for missing data: [1] "Total missing values: 0" 'data.frame': 32 obs. of 12 variables: $ model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ... $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... $ disp : num 160 160 108 258 360 ... $ hp : int 110 110 93 110 175 105 245 62 95 123 ... $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec : num 16.5 17 18.6 19.4 17 ... $ vs : int 0 0 1 1 0 1 0 1 1 1 ... $ am : int 1 1 1 0 0 0 0 0 0 0 ... $ gear : int 4 4 4 3 3 3 3 4 4 4 ... $ carb : int 4 4 1 1 2 1 4 2 2 4 ... Data cleaning complete! Variables are properly formatted. |
Must Read To Know What’s Special About Machine Learning?
This step helps you have a better view of the dataset using summary statistics. You can check key metrics like fuel efficiency and horsepower to understand important insights. The code for this step is:
# Step 6: Explore our data – like being a detective!
# Basic statistics summary
cat("Basic Statistics Summary:\n")
summary(mtcars_data) # Provides min, max, mean, and quartiles for each numeric column
# Let's look at fuel efficiency (mpg) – most important for car buyers!
cat("\n Fuel Efficiency Analysis:\n")
print(paste("Most fuel-efficient car:", mtcars_data$model[which.max(mtcars_data$mpg)],
"with", max(mtcars_data$mpg), "mpg")) # Finds car with highest mpg
print(paste("Least fuel-efficient car:", mtcars_data$model[which.min(mtcars_data$mpg)],
"with", min(mtcars_data$mpg), "mpg")) # Finds car with lowest mpg
# Let's see which cars have the most horsepower
cat("\n Power Analysis:\n")
print(paste("Most powerful car:", mtcars_data$model[which.max(mtcars_data$hp)],
"with", max(mtcars_data$hp), "horsepower")) # Finds car with highest horsepower
The output for this section will give us a basic summary of the dataset and also give the fuel efficiency and power analysis of the cars in the dataset.
Basic Statistics Summary: model mpg cyl disp hp Length:32 Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Class :character 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 Mode :character Median :19.20 8:14 Median :196.3 Median :123.0 Mean :20.09 Mean :230.7 Mean :146.7 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 Max. :33.90 Max. :472.0 Max. :335.0 drat wt qsec vs am Min. :2.760 Min. :1.513 Min. :14.50 V-shaped:18 Automatic:19 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 Straight:14 Manual :13 Median :3.695 Median :3.325 Median :17.71 Mean :3.597 Mean :3.217 Mean :17.85 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 Max. :4.930 Max. :5.424 Max. :22.90 gear carb 3:15 1: 7 4:12 2:10 5: 5 3: 3 4:10 6: 1 8: 1 Fuel Efficiency Analysis: [1] "Most fuel-efficient car: Toyota Corolla with 33.9 mpg" [1] "Least fuel-efficient car: Cadillac Fleetwood with 10.4 mpg" Power Analysis: [1] "Most powerful car: Maserati Bora with 335 horsepower" |
In this step, we’ll use the relevant data and create graphs and charts to understand the data better. The code for this step is given in the code block below:
# Step 7: Create beautiful charts – turn numbers into pictures!
# Chart 1: Fuel efficiency distribution
ggplot(mtcars_data, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black", alpha = 0.7) + # Histogram of MPG
labs(title = "Distribution of Fuel Efficiency (MPG)", # Main chart title
subtitle = "How fuel-efficient are these cars?", # Subheading
x = "Miles per Gallon (MPG)", # X-axis label
y = "Number of Cars") + # Y-axis label
theme_minimal() + # Clean theme
theme(plot.title = element_text(size = 16, face = "bold")) # Bold title styling
# Chart 2: Horsepower vs Fuel Efficiency
ggplot(mtcars_data, aes(x = hp, y = mpg)) +
geom_point(size = 3, alpha = 0.7, color = "red") + # Scatterplot points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear trend line
labs(title = "Power vs Efficiency: The Trade-off", # Chart title
subtitle = "Do more powerful cars use more fuel?", # Subheading
x = "Horsepower", # X-axis label
y = "Miles per Gallon (MPG)") + # Y-axis label
theme_minimal() # Clean look
# Chart 3: Transmission type comparison
ggplot(mtcars_data, aes(x = am, y = mpg, fill = am)) +
geom_boxplot(alpha = 0.7) + # Boxplot by transmission type
labs(title = "🔧 Manual vs Automatic: Fuel Efficiency Battle", # Chart title
x = "Transmission Type", # X-axis label
y = "Miles per Gallon (MPG)", # Y-axis label
fill = "Transmission") + # Legend label
theme_minimal() + # Clean theme
scale_fill_manual(values = c("orange", "green")) # Custom colors for boxes
The output of this section will give us three plots. The first chart will be on the fuel efficiency, second will be on the comparison of horsepower and fuel efficiency, and the third will be on the comparison of transmission type.
`geom_smooth()` using formula = 'y ~ x' |
Must Know: What is Data Wrangling? Exploring Its Role in Data Analysis
In this step, we will analyze how different variables in the dataset relate to each other using a correlation heatmap. This will help us identify strong patterns and dependencies, which can guide further analysis or modeling if necessary. The code for this step is given below:
# Step 8: Find relationships between variables – like finding patterns!
# Select only numeric columns for correlation
numeric_data <- mtcars_data %>%
select_if(is.numeric) %>% # Keep only numeric columns
select(-matches("model")) # Remove the model column if it's present
# Create correlation matrix
correlation_matrix <- cor(numeric_data) # Compute pairwise correlations
# Visualize correlations with a heatmap
corrplot(correlation_matrix,
method = "color", # Use color blocks to show strength
type = "upper", # Show only the upper triangle
order = "hclust", # Cluster similar variables together
tl.cex = 0.8, # Size of variable labels
tl.col = "black", # Label color
title = "Correlation Heatmap: How Variables Relate") # Chart title
# Find strongest correlations
cat("🔗 Strongest Relationships:\n")
# Convert correlation matrix to find top correlations
cor_pairs <- which(abs(correlation_matrix) > 0.7 & correlation_matrix != 1, arr.ind = TRUE) # Filter strong correlations
# Loop through and print variable pairs with high correlation
for(i in 1:nrow(cor_pairs)) {
row_var <- rownames(correlation_matrix)[cor_pairs[i,1]]
col_var <- colnames(correlation_matrix)[cor_pairs[i,2]]
cor_value <- round(correlation_matrix[cor_pairs[i,1], cor_pairs[i,2]], 3)
print(paste(row_var, "and", col_var, "correlation:", cor_value)) # Print each strong correlation pair
}
The output of this section shows the heatmap of the various correlations
The above graph can be interpreted as:
Read This: Machine Learning with R: Everything You Need to Know
In this step, we’ll group cars based on the number of cylinders. This helps understand how engine size affects fuel efficiency, horsepower, and weight. We’ll also plot the MPG distribution across these groups. The code for this step is:
# Step 9: Dig deeper – advanced insights!
# Group analysis by number of cylinders
cylinder_analysis <- mtcars_data %>%
group_by(cyl) %>% # Group data by cylinder count
summarise(
count = n(), # Number of cars in each group
avg_mpg = round(mean(mpg), 2), # Average miles per gallon
avg_hp = round(mean(hp), 2), # Average horsepower
avg_weight = round(mean(wt), 2), # Average weight
.groups = 'drop' # Drop grouping structure
)
# Display summary table
cat("Analysis by Number of Cylinders:\n")
knitr::kable(cylinder_analysis, caption = "Performance by Engine Size") # Nicely format the table
# Create a comprehensive comparison chart
ggplot(mtcars_data, aes(x = cyl, y = mpg, fill = cyl)) +
geom_violin(alpha = 0.7) + # Violin plot shows MPG distribution shape
geom_boxplot(width = 0.2, alpha = 0.8) + # Boxplot adds summary stats (median, quartiles)
labs(title = "Fuel Efficiency by Engine Size", # Chart title
subtitle = "Distribution of MPG across different cylinder counts", # Subheading
x = "Number of Cylinders", # X-axis label
y = "Miles per Gallon (MPG)") + # Y-axis label
theme_minimal() # Clean layout
The output for this step is as follows:
Analysis by Number of Cylinders: Table: Performance by Engine Size
|
The above plot shows that the bigger the engine size, the lower the fuel efficiency.
Here we’ll build a linear regression model to predict MPG using weight (wt) and horsepower (hp), inspect how well it fits, generate predictions, and compare them to the actual values. The code for this step is as follows:
# Step 10: Predict fuel efficiency - become a fortune teller!
# Create a simple linear model to predict MPG based on weight and horsepower
model <- lm(mpg ~ wt + hp, data = mtcars_data) # Fit a linear regression model
# Model summary
cat("Fuel Efficiency Prediction Model:\n")
summary(model) # View coefficients, p-values, R², etc.
# Make predictions for our existing cars
mtcars_data$predicted_mpg <- predict(model) # Add predicted MPG as a new column
# Compare actual vs predicted
comparison <- mtcars_data %>%
select(model, mpg, predicted_mpg) %>% # Keep only relevant columns
mutate(
predicted_mpg = round(predicted_mpg, 2), # Round predictions for readability
difference = round(mpg - predicted_mpg, 2) # Positive means model underestimated MPG
)
cat("\n Actual vs Predicted MPG (first 10 cars):\n")
head(comparison, 10) %>% knitr::kable() # Display first 10 comparisons as a neat table
The output for this code will give us a table of the comparison of fuel efficiency relative to weight and horsepower. The output is:
Fuel Efficiency Prediction Model: Call: lm(formula = mpg ~ wt + hp, data = mtcars_data) Residuals: Min 1Q Median 3Q Max -3.941 -1.600 -0.182 1.050 5.854 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.22727 1.59879 23.285 < 2e-16 *** wt -3.87783 0.63273 -6.129 1.12e-06 *** hp -0.03177 0.00903 -3.519 0.00145 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.593 on 29 degrees of freedom Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148 F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12 Actual vs Predicted MPG (first 10 cars):
|
In this Car Data Analysis project, we built a simple linear regression model using R in Google Colab to predict fuel efficiency (MPG) based on car weight and horsepower.
After uploading and cleaning the classic mtcars dataset, we created statistical summaries, visualized trends, and examined correlations between variables. The final model explained approximately 83% of the variation in MPG, with a residual standard error of 2.59, showing excellent predictive performance.
This car data analysis project involved key concepts like data wrangling, visualization, correlation analysis, and predictive modeling using automotive data.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/18HvmvopmAZOMC4O8cO3j3pPuSNiSZCrA#scrollTo=0A1dQbc1m-2f
802 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources