Predicting Travel Costs Using the Traveler Trip Dataset

By Rohit Sharma

Updated on Jul 31, 2025 | 12 min read | 1.35K+ views

Share:

Have you ever pondered how much a trip might cost depending on your age group, the destination, and the length of the trip? In this project, we'll examine the Kaggle traveler trip dataset to find trends in traveler behavior and use machine learning to forecast the overall cost of travel.

We will be looking at both demographic and trip-related features like the age, nationality, destination, accommodation, and transportation. The objective is to build a regression model to predict the total cost of a trip, which can be used by travel agencies to design better travel packages.

For more project ideas like this one, check out our blog post on the Top 25+ Essential Data Science Projects GitHub to Explore in 2025.  

What You'll Need to Begin

Following along with the project requires no machine learning expertise, but it helps to have a little bit of background in it. You will want to know about the following:

Technologies and Libraries That We Will Use

The program will use the following tools:

Models That Will Be Utilized

Here are the machine learning models that we will be utilizing:

Model Name

Description

Linear Regression A straightforward model for comprehending the linear relationships between cost and features.
Decision Tree Regressor A tree-based model that predicts cost by dividing data according to decision rules.
Random Forest Regressor A group of decision trees that decreases overfitting and increases prediction accuracy.
XGBoost Regressor A strong gradient boosting model that effectively manages big datasets and intricate relationships.

How to Build a Trip Cost Prediction Model with the Traveler Trip Dataset

Let’s start building the project from scratch. So, without wasting any more time, let’s begin!

Step 1: Downloading and Loading the Dataset

We will download the traveler trip dataset with the kagglehub library. This dataset contains information about the travellers, their stay, the accommodation, the transport, and the length of the stay.

This is how you would do it in code:

import kagglehub

# Download the latest version of the Traveler Trip Dataset
path = kagglehub.dataset_download("rkiattisak/traveler-trip-data")
print("Path to dataset files:", path)

Output:

Downloading from https://www.kaggle.com/api/v1/datasets/download/rkiattisak/traveler-trip-data?dataset_version_number=1...

100%|██████████| 4.21k/4.21k [00:00<00:00, 4.47MB/s]Extracting files...

Path to dataset files: /root/.cache/kagglehub/datasets/rkiattisak/traveler-trip-data/versions/1

Step 2: Upload the Traveler Trip Dataset 

We will now upload the traveler trip dataset. Use the below-mentioned code to so:

import pandas as pd

# Correct path to the actual CSV file
df = pd.read_csv("/root/.cache/kagglehub/datasets/rkiattisak/traveler-trip-data/versions/1/Travel details dataset.csv")

# Display the first 5 rows
df.head()

Output:

 

Trip ID

Destination

Start date

End date

Duration (days)

Traveler name

Traveler age

Traveler gender

Traveler nationality

Accommodation type

Accommodation cost

Transportation type

Transportation cost

0

1

London, UK

5/1/2023

5/8/2023

7.0

John Smith

35.0

Male

American

Hotel

1200

Flight

600

1

2

Phuket, Thailand

6/15/2023

6/20/2023

5.0

Jane Doe

28.0

Female

Canadian

Resort

800

Flight

500

2

3

Bali, Indonesia

7/1/2023

7/8/2023

7.0

David Lee

45.0

Male

Korean

Villa

1000

Flight

700

3

4

New York, USA

8/15/2023

8/29/2023

14.0

Sarah Johnson

29.0

Female

British

Hotel

2000

Flight

1000

4

5

Tokyo, Japan

9/10/2023

9/17/2023

7.0

Kim Nguyen

26.0

Female

Vietnamese

Airbnb

700

Train

200

 

 

Step 3: Explore the Dataset

Now, in this step, we will examine the traveler trip dataset, so that we can learn about its composition. By doing so, we will be able to verify data types, null values, and fundamental statistics prior to preprocessing.

Use the code given below to understand the dataset:

# Shape of the dataset (rows, columns)
print("Dataset shape:", df.shape)

# Data types and non-null info
print("\nInfo:")
df.info()

# Summary statistics for numerical columns
print("\nSummary statistics:")
df.describe()

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

Output:

Dataset shape: (139, 13)

Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 139 entries, 0 to 138

Data columns (total 13 columns):

 #   Column                Non-Null Count  Dtype  

---  ------                --------------  -----  

 0   Trip ID               139 non-null    int64  

 1   Destination           137 non-null    object 

 2   Start date            137 non-null    object 

 3   End date              137 non-null    object 

 4   Duration (days)       137 non-null    float64

 5   Traveler name         137 non-null    object 

 6   Traveler age          137 non-null    float64

 7   Traveler gender       137 non-null    object 

 8   Traveler nationality  137 non-null    object 

 9   Accommodation type    137 non-null    object 

 10  Accommodation cost    137 non-null    object 

 11  Transportation type   136 non-null    object 

 12  Transportation cost   136 non-null    object 

dtypes: float64(2), int64(1), object(10)

memory usage: 14.2+ KB

Summary statistics:

Missing values:

Trip ID                 0

Destination             2

Start date              2

End date                2

Duration (days)         2

Traveler name           2

Traveler age            2

Traveler gender         2

Traveler nationality    2

Accommodation type      2

Accommodation cost      2

Transportation type     3

Transportation cost     3

dtype: int64

What does the output tell us?

The output tells us that - 

  • There are 139 rows and 13 columns
  • Most columns have 2 missing values
  • Transportation type and Transportation cost have 3 missing values

Step 4: Data Cleaning

In this step, we will clean the data. Use the below-mentioned code to do so:

# Use regex to remove all non-numeric characters except dot (.)
df['Accommodation cost'] = df['Accommodation cost'].str.replace(r'[^\d.]', '', regex=True).astype(float)
df['Transportation cost'] = df['Transportation cost'].str.replace(r'[^\d.]', '', regex=True).astype(float)

# Confirm the conversion
print("Accommodation cost type:", df['Accommodation cost'].dtype)
print("Transportation cost type:", df['Transportation cost'].dtype)

Output:

Accommodation cost type: float64

Transportation cost type: float64

Step 5: Handle Missing Values

Now we have clean columns. Let's handle missing values. To do so, we will: 

  • Check the number of missing values per column.
  • Impute missing numeric fields, such as age and cost, using the median.
  • Impute missing categorical fields, such as - gender, transportation type, etc., using the mode.
  • Drop rows with any remaining missing values.

Here is the code to achieve all this:

# Check missing values
print(df.isnull().sum())

# Impute numerical columns using median
df['Accommodation cost'] = df['Accommodation cost'].fillna(df['Accommodation cost'].median())
df['Transportation cost'] = df['Transportation cost'].fillna(df['Transportation cost'].median())
df['Traveler age'] = df['Traveler age'].fillna(df['Traveler age'].median())

# Impute categorical columns using mode
df['Traveler gender'] = df['Traveler gender'].fillna(df['Traveler gender'].mode()[0])
df['Transportation type'] = df['Transportation type'].fillna(df['Transportation type'].mode()[0])

# Drop remaining rows with missing values
df.dropna(inplace=True)

# Final check
print("\nMissing values after cleaning:")
print(df.isnull().sum())

Output:

Trip ID                 0

Destination             2

Start date              2

End date                2

Duration (days)         2

Traveler name           2

Traveler age            2

Traveler gender         2

Traveler nationality    2

Accommodation type      2

Accommodation cost      0

Transportation type     3

Transportation cost     0

dtype: int64

 

Missing values after cleaning:

Trip ID                 0

Destination             0

Start date              0

End date                0

Duration (days)         0

Traveler name           0

Traveler age            0

Traveler gender         0

Traveler nationality    0

Accommodation type      0

Accommodation cost      0

Transportation type     0

Transportation cost     0

dtype: int64

Step 6: Feature Engineering

Now that the data is clean, we will improve model performance and analysis by extracting valuable features from existing columns. Our objective in this step is to:

  • Create a new column for trip duration. To do so, we will use start and end dates.
  • Extract day, month, and year from the start date. This is needed for time-based insights.

Note: We will handle encoding categorical variables later during the model preparation phase.

Use the code mentioned below to accomplish all this:

# Calculate duration (in days) from start and end dates
df['Start date'] = pd.to_datetime(df['Start date'])
df['End date'] = pd.to_datetime(df['End date'])
df['Duration (days)'] = (df['End date'] - df['Start date']).dt.days

# Extract features from Start date
df['Travel Month'] = df['Start date'].dt.month
df['Travel Year'] = df['Start date'].dt.year
df['Travel Day'] = df['Start date'].dt.day

# Optional: encode gender (if required for modeling)
# df['Traveler gender'] = df['Traveler gender'].map({'Male': 0, 'Female': 1})

# Preview the engineered features
print(df[['Start date', 'End date', 'Duration (days)', 'Travel Day', 'Travel Month', 'Travel Year']].head())

Output:

Start date   End date  Duration (days)  Travel Day  Travel Month  \

0 2023-05-01 2023-05-08                7           1             5   

1 2023-06-15 2023-06-20                5          15             6   

2 2023-07-01 2023-07-08                7           1             7   

3 2023-08-15 2023-08-29               14          15             8   

4 2023-09-10 2023-09-17                7          10             9   

 

   Travel Year  

0         2023  

1         2023  

2         2023  

3         2023  

4         2023  

Step 7: Exploratory Data Analysis (EDA)

In order to identify trends in trip length, monthly travel frequency, and cost relationships, we will employ visualizations in this step. We can better comprehend seasonality, preferences, and budgetary behavior with the aid of these plots.

Use the below-mentioned code:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 12))

# Plot 1: Distribution of Trip Durations
plt.subplot(3, 1, 1)
sns.histplot(df['Duration (days)'], bins=10, kde=True, color='skyblue')
plt.title("Distribution of Trip Durations")
plt.xlabel("Duration (days)")
plt.ylabel("Number of Trips")

# Plot 2: Number of Trips Per Month
plt.subplot(3, 1, 2)
sns.countplot(x='Travel Month', data=df, palette='Set2')
plt.title("Trips per Month")
plt.xlabel("Month")
plt.ylabel("Number of Trips")

# Plot 3: Accommodation vs Transportation Cost
plt.subplot(3, 1, 3)
sns.scatterplot(data=df, x='Accommodation cost', y='Transportation cost', hue='Travel Month', palette='viridis')
plt.title("Accommodation vs Transportation Cost")
plt.xlabel("Accommodation Cost")
plt.ylabel("Transportation Cost")
plt.legend(title='Month')

plt.tight_layout()
plt.show()

Output:

  • According to the first plot, most journeys last six to eight days. There are fewer longer-duration journeys, and the distribution is right-skewed.
  • The middle plot states that the months from May to September (5–9) are the most suitable months for travelling, while March (3) and December (12) have moderate to less activity.  
  • There is a positive trend, according to the bottom plot. Transportation expenses tend to increase with lodging expenses. Some trips can have lodging fees of up to $8,000, alongside transportation expenses of $3,000. Trips of each month are color-coded. They overlap in areas of cost trends for different months, and for that reason, we see no major cost outliers by month.

Step 8: Examine the Numerical Features' Correlation

Let’s visualize travel trends before examining the relationships between numerical features. We will examine the relationships between characteristics such as trip duration, travel cost, and booking amount using a correlation matrix. This aids in locating highly correlated fields that could cause redundancy or have an impact on the prediction model.

The code to accomplish this is as follows:

# Select only numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Generate correlation matrix
corr_matrix = numeric_df.corr()

# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Features")
plt.show()

Output:

  • According to the heatmap, there is a moderately positive correlation (0.45) between trip ID and transportation cost. When one rises, the other usually follows suit.
  • The cost of lodging and transportation is strongly positively correlated (0.79). Spending more on one typically results in spending more on the other.
  • There is a weak positive correlation between trip ID and lodging cost (0.35).
  • There is little to no correlation between traveler age and any other characteristic.

In summary, spending habits (travel and lodging) are more closely linked to one another than to age or length of stay.

Step 9: One-Hot Encode Categorical Variables

Now, we apply one-hot encoding to transform all our categorical variables in the data set into numerical values so that they will be model-ready. This thus creates new binary columns for every category. Once again, to make the resultant boolean values friendly with our machine learning models, we shall convert them to integers (1 or 0).

Use the code listed below:

# List of categorical columns to encode
categorical_cols = ['Destination', 'Traveler gender', 'Traveler nationality',
                    'Accommodation type', 'Transportation type']

# Apply one-hot encoding and convert booleans to integers
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Convert all boolean columns to integers (ML models expect numeric input)
df_encoded = df_encoded.astype(int, errors='ignore')

# Preview the encoded DataFrame
print("Final Encoded DataFrame shape:", df_encoded.shape)
df_encoded.head()

Output:

Final Encoded DataFrame shape: (137, 125)

 

Trip ID

Start date

End date

Duration (days)

Traveler name

Traveler age

Accommodation cost

Transportation cost

Travel Month

Travel Year

...

Accommodation type_Vacation rental

Accommodation type_Villa

Transportation type_Bus

Transportation type_Car

Transportation type_Car rental

Transportation type_Ferry

Transportation type_Flight

Transportation type_Plane

Transportation type_Subway

Transportation type_Train

0

1

1682899200000000000

1683504000000000000

7

John Smith

35

1200

600

5

2023

...

0

0

0

0

0

0

1

0

0

0

1

2

1686787200000000000

1687219200000000000

5

Jane Doe

28

800

500

6

2023

...

0

0

0

0

0

0

1

0

0

0

2

3

1688169600000000000

1688774400000000000

7

David Lee

45

1000

700

7

2023

...

0

1

0

0

0

0

1

0

0

0

3

4

1692057600000000000

1693267200000000000

14

Sarah Johnson

29

2000

1000

8

2023

...

0

0

0

0

0

0

1

0

0

0

4

5

1694304000000000000

1694908800000000000

7

Kim Nguyen

26

700

200

9

2023

...

0

0

0

0

0

0

0

0

0

1

 

5 rows × 125 columns

Step 10: Train-Test Split

In this step, we will split the dataset into training and testing subsets. Here is the code to do so:

from sklearn.model_selection import train_test_split

# Define input features (X) and target variable (y)
X = df_encoded.drop(['Duration (days)'], axis=1)
y = df_encoded['Duration (days)']

# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Output:

Training set shape: (109, 124)

Test set shape: (28, 124)

Step 11: Train and Compare Regression Models

In this step, we will train the following regression models to predict trip duration based on customer and trip features:

  • Linear Regression
  • Decision Tree Regressor
  • Random Forest Regressor
  • XGBoost Regressor

We will evaluate them using - R² Score and Mean Squared Error (MSE). Here is the code

Here is the code:

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Drop non-numeric columns like 'Traveler name' or IDs
X = df_encoded.drop(["Duration (days)", "Traveler name"], axis=1)
y = df_encoded["Duration (days)"]

# Train-test split (already done earlier, repeating here for safety)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    print(f"\nModel: {name}")
    print("Mean Absolute Error (MAE):", round(mean_absolute_error(y_test, predictions), 2))
    print("Mean Squared Error (MSE):", round(mean_squared_error(y_test, predictions), 2))
    print("R² Score:", round(r2_score(y_test, predictions), 2))

Output:

Training set shape: (109, 123)

Test set shape: (28, 123)

Model: Linear Regression

Mean Absolute Error (MAE): 0.0

Mean Squared Error (MSE): 0.0

R² Score: 1.0

 

Model: Decision Tree

Mean Absolute Error (MAE): 1.36

Mean Squared Error (MSE): 4.43

R² Score: -1.05

 

Model: Random Forest

Mean Absolute Error (MAE): 1.13

Mean Squared Error (MSE): 1.95

R² Score: 0.1

 

Model: XGBoost

Mean Absolute Error (MAE): 1.35

Mean Squared Error (MSE): 3.27

R² Score: -0.52

Conclusion

For the prediction of travel duration, this project trained and then evaluated four regression models, such as Linear Regression, Decision Tree Regressor, Random Forest Regressor, and XGBoost Regressor.

  • Linear Regression beat all other models with perfect scores, so MAE equals 0.0, MSE equals 0.0, together with R² equals 1.0. This indicates that it is a perfect fit. The model fits the data.
  • Decision Tree Regressor and XGBoost Regressor negative R² scores showed predictions worse than simply using the target variable's mean. They also had higher MAE and MSE values, and this suggests poor generalization.
  • Random Forest Regressor slightly outperformed tree-based models, yet reliable accuracy was still not reached with an R² Score = 0.1.

Overall, Linear Regression is the best choice for this dataset.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference:
https://colab.research.google.com/drive/1YIxiVyxdWY2GYXewHa6O41Vjrk0rvYM4?usp=sharing

Frequently Asked Questions (FAQs)

1. Why did Linear Regression perform better than other models?

2. What does a negative R² score indicate, such as in the case of Decision Tree and XGBoost?

3. Why did complex models like Random Forest and XGBoost perform poorly?

4. Should we always choose a model with a higher R²?

Rohit Sharma

802 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months