Home
Blog
Data Science
Predicting Travel Costs Using the Traveler Trip Dataset

Predicting Travel Costs Using the Traveler Trip Dataset

Updated on Jul 31, 2025 | 12 min read | 1.74K+ views

Table of Contents

View all

What You'll Need to Begin
Technologies and Libraries That We Will Use
Models That Will Be Utilized
How to Build a Trip Cost Prediction Model with the Traveler Trip Dataset
Conclusion

Have you ever pondered how much a trip might cost depending on your age group, the destination, and the length of the trip? In this project, we'll examine the Kaggle traveler trip dataset to find trends in traveler behavior and use machine learning to forecast the overall cost of travel.

We will be looking at both demographic and trip-related features like the age, nationality, destination, accommodation, and transportation. The objective is to build a regression model to predict the total cost of a trip, which can be used by travel agencies to design better travel packages.

For more project ideas like this one, check out our blog post on the Top 25+ Essential Data Science Projects GitHub to Explore in 2025.

What You'll Need to Begin

Following along with the project requires no machine learning expertise, but it helps to have a little bit of background in it. You will want to know about the following:

Python and Pandas basics
Exploratory Data Analysis (EDA)
Regression algorithms
Train-test split and model evaluation (R² Score, MSE)

Technologies and Libraries That We Will Use

The program will use the following tools:

Python (Colab)
Pandas, NumPy for data wrangling
Matplotlib & Seaborn for visualization
Scikit-learn for modeling
XGBoost for serious regression

Models That Will Be Utilized

Here are the machine learning models that we will be utilizing:

Model Name	Description
Linear Regression	A straightforward model for comprehending the linear relationships between cost and features.
Decision Tree Regressor	A tree-based model that predicts cost by dividing data according to decision rules.
Random Forest Regressor	A group of decision trees that decreases overfitting and increases prediction accuracy.
XGBoost Regressor	A strong gradient boosting model that effectively manages big datasets and intricate relationships.

How to Build a Trip Cost Prediction Model with the Traveler Trip Dataset

Let’s start building the project from scratch. So, without wasting any more time, let’s begin!

Step 1: Downloading and Loading the Dataset

We will download the traveler trip dataset with the kagglehub library. This dataset contains information about the travellers, their stay, the accommodation, the transport, and the length of the stay.

This is how you would do it in code:

import kagglehub

# Download the latest version of the Traveler Trip Dataset
path = kagglehub.dataset_download("rkiattisak/traveler-trip-data")
print("Path to dataset files:", path)

Output:

Downloading from https://www.kaggle.com/api/v1/datasets/download/rkiattisak/traveler-trip-data?dataset_version_number=1...

100%|██████████| 4.21k/4.21k [00:00<00:00, 4.47MB/s]Extracting files...

Path to dataset files: /root/.cache/kagglehub/datasets/rkiattisak/traveler-trip-data/versions/1

Step 2: Upload the Traveler Trip Dataset

We will now upload the traveler trip dataset. Use the below-mentioned code to so:

import pandas as pd

# Correct path to the actual CSV file
df = pd.read_csv("/root/.cache/kagglehub/datasets/rkiattisak/traveler-trip-data/versions/1/Travel details dataset.csv")

# Display the first 5 rows
df.head()

Output:

Trip ID

Destination

Start date

End date

Duration (days)

Traveler name

Traveler age

Traveler gender

Traveler nationality

Accommodation type

Accommodation cost

Transportation type

Transportation cost

London, UK

5/1/2023

5/8/2023

7.0

John Smith

35.0

Male

American

Hotel

1200

Flight

600

Phuket, Thailand

6/15/2023

6/20/2023

5.0

Jane Doe

28.0

Female

Canadian

Resort

800

Flight

500

Bali, Indonesia

7/1/2023

7/8/2023

7.0

David Lee

45.0

Male

Korean

Villa

1000

Flight

700

New York, USA

8/15/2023

8/29/2023

14.0

Sarah Johnson

29.0

Female

British

Hotel

2000

Flight

1000

Tokyo, Japan

9/10/2023

9/17/2023

7.0

Kim Nguyen

26.0

Female

Vietnamese

Airbnb

700

Train

200

Step 3: Explore the Dataset

Now, in this step, we will examine the traveler trip dataset, so that we can learn about its composition. By doing so, we will be able to verify data types, null values, and fundamental statistics prior to preprocessing.

Use the code given below to understand the dataset:

# Shape of the dataset (rows, columns)
print("Dataset shape:", df.shape)

# Data types and non-null info
print("\nInfo:")
df.info()

# Summary statistics for numerical columns
print("\nSummary statistics:")
df.describe()

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

Output:

Dataset shape: (139, 13)

Info:

RangeIndex: 139 entries, 0 to 138

Data columns (total 13 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Trip ID 139 non-null int64

1 Destination 137 non-null object

2 Start date 137 non-null object

3 End date 137 non-null object

4 Duration (days) 137 non-null float64

5 Traveler name 137 non-null object

6 Traveler age 137 non-null float64

7 Traveler gender 137 non-null object

8 Traveler nationality 137 non-null object

9 Accommodation type 137 non-null object

10 Accommodation cost 137 non-null object

11 Transportation type 136 non-null object

12 Transportation cost 136 non-null object

dtypes: float64(2), int64(1), object(10)

memory usage: 14.2+ KB

Summary statistics:

Missing values:

Trip ID 0

Destination 2

Start date 2

End date 2

Duration (days) 2

Traveler name 2

Traveler age 2

Traveler gender 2

Traveler nationality 2

Accommodation type 2

Accommodation cost 2

Transportation type 3

Transportation cost 3

dtype: int64

What does the output tell us?

The output tells us that -

There are 139 rows and 13 columns
Most columns have 2 missing values
Transportation type and Transportation cost have 3 missing values

Step 4: Data Cleaning

In this step, we will clean the data. Use the below-mentioned code to do so:

# Use regex to remove all non-numeric characters except dot (.)
df['Accommodation cost'] = df['Accommodation cost'].str.replace(r'[^\d.]', '', regex=True).astype(float)
df['Transportation cost'] = df['Transportation cost'].str.replace(r'[^\d.]', '', regex=True).astype(float)

# Confirm the conversion
print("Accommodation cost type:", df['Accommodation cost'].dtype)
print("Transportation cost type:", df['Transportation cost'].dtype)

Output:

Accommodation cost type: float64

Transportation cost type: float64

Step 5: Handle Missing Values

Now we have clean columns. Let's handle missing values. To do so, we will:

Check the number of missing values per column.
Impute missing numeric fields, such as age and cost, using the median.
Impute missing categorical fields, such as - gender, transportation type, etc., using the mode.
Drop rows with any remaining missing values.

Here is the code to achieve all this:

# Check missing values
print(df.isnull().sum())

# Impute numerical columns using median
df['Accommodation cost'] = df['Accommodation cost'].fillna(df['Accommodation cost'].median())
df['Transportation cost'] = df['Transportation cost'].fillna(df['Transportation cost'].median())
df['Traveler age'] = df['Traveler age'].fillna(df['Traveler age'].median())

# Impute categorical columns using mode
df['Traveler gender'] = df['Traveler gender'].fillna(df['Traveler gender'].mode()[0])
df['Transportation type'] = df['Transportation type'].fillna(df['Transportation type'].mode()[0])

# Drop remaining rows with missing values
df.dropna(inplace=True)

# Final check
print("\nMissing values after cleaning:")
print(df.isnull().sum())

Output:

Trip ID 0

Destination 2

Start date 2

End date 2

Duration (days) 2

Traveler name 2

Traveler age 2

Traveler gender 2

Traveler nationality 2

Accommodation type 2

Accommodation cost 0

Transportation type 3

Transportation cost 0

dtype: int64

Missing values after cleaning:

Trip ID 0

Destination 0

Start date 0

End date 0

Duration (days) 0

Traveler name 0

Traveler age 0

Traveler gender 0

Traveler nationality 0

Accommodation type 0

Accommodation cost 0

Transportation type 0

Transportation cost 0

dtype: int64

Step 6: Feature Engineering

Now that the data is clean, we will improve model performance and analysis by extracting valuable features from existing columns. Our objective in this step is to:

Create a new column for trip duration. To do so, we will use start and end dates.
Extract day, month, and year from the start date. This is needed for time-based insights.

Note: We will handle encoding categorical variables later during the model preparation phase.

Use the code mentioned below to accomplish all this:

# Calculate duration (in days) from start and end dates
df['Start date'] = pd.to_datetime(df['Start date'])
df['End date'] = pd.to_datetime(df['End date'])
df['Duration (days)'] = (df['End date'] - df['Start date']).dt.days

# Extract features from Start date
df['Travel Month'] = df['Start date'].dt.month
df['Travel Year'] = df['Start date'].dt.year
df['Travel Day'] = df['Start date'].dt.day

# Optional: encode gender (if required for modeling)
# df['Traveler gender'] = df['Traveler gender'].map({'Male': 0, 'Female': 1})

# Preview the engineered features
print(df[['Start date', 'End date', 'Duration (days)', 'Travel Day', 'Travel Month', 'Travel Year']].head())

Output:

Start date End date Duration (days) Travel Day Travel Month \

0 2023-05-01 2023-05-08 7 1 5

1 2023-06-15 2023-06-20 5 15 6

2 2023-07-01 2023-07-08 7 1 7

3 2023-08-15 2023-08-29 14 15 8

4 2023-09-10 2023-09-17 7 10 9

Travel Year

0 2023

1 2023

2 2023

3 2023

4 2023

Step 7: Exploratory Data Analysis (EDA)

In order to identify trends in trip length, monthly travel frequency, and cost relationships, we will employ visualizations in this step. We can better comprehend seasonality, preferences, and budgetary behavior with the aid of these plots.

Use the below-mentioned code:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 12))

# Plot 1: Distribution of Trip Durations
plt.subplot(3, 1, 1)
sns.histplot(df['Duration (days)'], bins=10, kde=True, color='skyblue')
plt.title("Distribution of Trip Durations")
plt.xlabel("Duration (days)")
plt.ylabel("Number of Trips")

# Plot 2: Number of Trips Per Month
plt.subplot(3, 1, 2)
sns.countplot(x='Travel Month', data=df, palette='Set2')
plt.title("Trips per Month")
plt.xlabel("Month")
plt.ylabel("Number of Trips")

# Plot 3: Accommodation vs Transportation Cost
plt.subplot(3, 1, 3)
sns.scatterplot(data=df, x='Accommodation cost', y='Transportation cost', hue='Travel Month', palette='viridis')
plt.title("Accommodation vs Transportation Cost")
plt.xlabel("Accommodation Cost")
plt.ylabel("Transportation Cost")
plt.legend(title='Month')

plt.tight_layout()
plt.show()

Output:

Popular Data Science Programs

Post Graduate Certificate in Data Science PG Diploma in Data Science MSc in Data Science Program DevOps Full Course Online M Sc in Data Science Degree

According to the first plot, most journeys last six to eight days. There are fewer longer-duration journeys, and the distribution is right-skewed.
The middle plot states that the months from May to September (5–9) are the most suitable months for travelling, while March (3) and December (12) have moderate to less activity.
There is a positive trend, according to the bottom plot. Transportation expenses tend to increase with lodging expenses. Some trips can have lodging fees of up to $8,000, alongside transportation expenses of $3,000. Trips of each month are color-coded. They overlap in areas of cost trends for different months, and for that reason, we see no major cost outliers by month.

Step 8: Examine the Numerical Features' Correlation

Let’s visualize travel trends before examining the relationships between numerical features. We will examine the relationships between characteristics such as trip duration, travel cost, and booking amount using a correlation matrix. This aids in locating highly correlated fields that could cause redundancy or have an impact on the prediction model.

The code to accomplish this is as follows:

# Select only numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Generate correlation matrix
corr_matrix = numeric_df.corr()

# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Features")
plt.show()

Output:

According to the heatmap, there is a moderately positive correlation (0.45) between trip ID and transportation cost. When one rises, the other usually follows suit.
The cost of lodging and transportation is strongly positively correlated (0.79). Spending more on one typically results in spending more on the other.
There is a weak positive correlation between trip ID and lodging cost (0.35).
There is little to no correlation between traveler age and any other characteristic.

In summary, spending habits (travel and lodging) are more closely linked to one another than to age or length of stay.

Step 9: One-Hot Encode Categorical Variables

Now, we apply one-hot encoding to transform all our categorical variables in the data set into numerical values so that they will be model-ready. This thus creates new binary columns for every category. Once again, to make the resultant boolean values friendly with our machine learning models, we shall convert them to integers (1 or 0).

Use the code listed below:

# List of categorical columns to encode
categorical_cols = ['Destination', 'Traveler gender', 'Traveler nationality',
                    'Accommodation type', 'Transportation type']

# Apply one-hot encoding and convert booleans to integers
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Convert all boolean columns to integers (ML models expect numeric input)
df_encoded = df_encoded.astype(int, errors='ignore')

# Preview the encoded DataFrame
print("Final Encoded DataFrame shape:", df_encoded.shape)
df_encoded.head()

Output:

Final Encoded DataFrame shape: (137, 125)

	Trip ID	Start date	End date	Duration (days)	Traveler name	Traveler age	Accommodation cost	Transportation cost	Travel Month	Travel Year	...	Accommodation type_Vacation rental	Accommodation type_Villa	Transportation type_Bus	Transportation type_Car	Transportation type_Car rental	Transportation type_Ferry	Transportation type_Flight	Transportation type_Plane	Transportation type_Subway	Transportation type_Train
0	1	1682899200000000000	1683504000000000000	7	John Smith	35	1200	600	5	2023	...	0	0	0	0	0	0	1	0	0	0
1	2	1686787200000000000	1687219200000000000	5	Jane Doe	28	800	500	6	2023	...	0	0	0	0	0	0	1	0	0	0
2	3	1688169600000000000	1688774400000000000	7	David Lee	45	1000	700	7	2023	...	0	1	0	0	0	0	1	0	0	0
3	4	1692057600000000000	1693267200000000000	14	Sarah Johnson	29	2000	1000	8	2023	...	0	0	0	0	0	0	1	0	0	0
4	5	1694304000000000000	1694908800000000000	7	Kim Nguyen	26	700	200	9	2023	...	0	0	0	0	0	0	0	0	0	1

5 rows × 125 columns

Step 10: Train-Test Split

In this step, we will split the dataset into training and testing subsets. Here is the code to do so:

from sklearn.model_selection import train_test_split

# Define input features (X) and target variable (y)
X = df_encoded.drop(['Duration (days)'], axis=1)
y = df_encoded['Duration (days)']

# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Output:

Training set shape: (109, 124)

Test set shape: (28, 124)

Step 11: Train and Compare Regression Models

In this step, we will train the following regression models to predict trip duration based on customer and trip features:

Linear Regression
Decision Tree Regressor
Random Forest Regressor
XGBoost Regressor

We will evaluate them using - R² Score and Mean Squared Error (MSE). Here is the code

Here is the code:

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Drop non-numeric columns like 'Traveler name' or IDs
X = df_encoded.drop(["Duration (days)", "Traveler name"], axis=1)
y = df_encoded["Duration (days)"]

# Train-test split (already done earlier, repeating here for safety)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    print(f"\nModel: {name}")
    print("Mean Absolute Error (MAE):", round(mean_absolute_error(y_test, predictions), 2))
    print("Mean Squared Error (MSE):", round(mean_squared_error(y_test, predictions), 2))
    print("R² Score:", round(r2_score(y_test, predictions), 2))

Output:

Training set shape: (109, 123)

Test set shape: (28, 123)

Model: Linear Regression

Mean Absolute Error (MAE): 0.0

Mean Squared Error (MSE): 0.0

R² Score: 1.0

Model: Decision Tree

Mean Absolute Error (MAE): 1.36

Mean Squared Error (MSE): 4.43

R² Score: -1.05

Model: Random Forest

Mean Absolute Error (MAE): 1.13

Mean Squared Error (MSE): 1.95

R² Score: 0.1

Model: XGBoost

Mean Absolute Error (MAE): 1.35

Mean Squared Error (MSE): 3.27

R² Score: -0.52

Conclusion

For the prediction of travel duration, this project trained and then evaluated four regression models, such as Linear Regression, Decision Tree Regressor, Random Forest Regressor, and XGBoost Regressor.

Linear Regression beat all other models with perfect scores, so MAE equals 0.0, MSE equals 0.0, together with R² equals 1.0. This indicates that it is a perfect fit. The model fits the data.
Decision Tree Regressor and XGBoost Regressor negative R² scores showed predictions worse than simply using the target variable's mean. They also had higher MAE and MSE values, and this suggests poor generalization.
Random Forest Regressor slightly outperformed tree-based models, yet reliable accuracy was still not reached with an R² Score = 0.1.

Overall, Linear Regression is the best choice for this dataset.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Reference:
https://colab.research.google.com/drive/1YIxiVyxdWY2GYXewHa6O41Vjrk0rvYM4?usp=sharing

Frequently Asked Questions (FAQs)

1. Why did Linear Regression perform better than other models?

Linear Regression probably did well, mostly due to the fact that the dataset had a linear relationship between the features and the target (Duration (days)). It may also be that the dataset was small enough or clean enough that simple models generalized better than complex ones.

2. What does a negative R² score indicate, such as in the case of Decision Tree and XGBoost?

A negative R² signifies that the model might have been performing worse than a baseline model that just predicts the mean of the target value. It tells of a scenario where the model fitted too poorly to learn the data or overfitted some noise in the dataset.

3. Why did complex models like Random Forest and XGBoost perform poorly?

The models might have overfitted on the training data because of a limited number of records or the presence of irrelevant, noisy features. Without proper tuning or enough data, an increase in model complexity does not always yield better predictions.

4. Should we always choose a model with a higher R²?

Not necessarily. While R² can certainly provide a sense of fit, one must consider other metrics, such as MAE and MSE, and real-world contextual factors on interpretability and the model's ability to generalize to unseen data.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources