Predicting Travel Costs Using the Traveler Trip Dataset
By Rohit Sharma
Updated on Jul 31, 2025 | 12 min read | 1.35K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 31, 2025 | 12 min read | 1.35K+ views
Share:
Table of Contents
Have you ever pondered how much a trip might cost depending on your age group, the destination, and the length of the trip? In this project, we'll examine the Kaggle traveler trip dataset to find trends in traveler behavior and use machine learning to forecast the overall cost of travel.
We will be looking at both demographic and trip-related features like the age, nationality, destination, accommodation, and transportation. The objective is to build a regression model to predict the total cost of a trip, which can be used by travel agencies to design better travel packages.
For more project ideas like this one, check out our blog post on the Top 25+ Essential Data Science Projects GitHub to Explore in 2025.
Following along with the project requires no machine learning expertise, but it helps to have a little bit of background in it. You will want to know about the following:
The program will use the following tools:
Here are the machine learning models that we will be utilizing:
Model Name |
Description |
Linear Regression | A straightforward model for comprehending the linear relationships between cost and features. |
Decision Tree Regressor | A tree-based model that predicts cost by dividing data according to decision rules. |
Random Forest Regressor | A group of decision trees that decreases overfitting and increases prediction accuracy. |
XGBoost Regressor | A strong gradient boosting model that effectively manages big datasets and intricate relationships. |
Let’s start building the project from scratch. So, without wasting any more time, let’s begin!
We will download the traveler trip dataset with the kagglehub library. This dataset contains information about the travellers, their stay, the accommodation, the transport, and the length of the stay.
This is how you would do it in code:
import kagglehub
# Download the latest version of the Traveler Trip Dataset
path = kagglehub.dataset_download("rkiattisak/traveler-trip-data")
print("Path to dataset files:", path)
Output:
Downloading from https://www.kaggle.com/api/v1/datasets/download/rkiattisak/traveler-trip-data?dataset_version_number=1...
100%|██████████| 4.21k/4.21k [00:00<00:00, 4.47MB/s]Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/rkiattisak/traveler-trip-data/versions/1
We will now upload the traveler trip dataset. Use the below-mentioned code to so:
import pandas as pd
# Correct path to the actual CSV file
df = pd.read_csv("/root/.cache/kagglehub/datasets/rkiattisak/traveler-trip-data/versions/1/Travel details dataset.csv")
# Display the first 5 rows
df.head()
Output:
Trip ID |
Destination |
Start date |
End date |
Duration (days) |
Traveler name |
Traveler age |
Traveler gender |
Traveler nationality |
Accommodation type |
Accommodation cost |
Transportation type |
Transportation cost |
|
0 |
1 |
London, UK |
5/1/2023 |
5/8/2023 |
7.0 |
John Smith |
35.0 |
Male |
American |
Hotel |
1200 |
Flight |
600 |
1 |
2 |
Phuket, Thailand |
6/15/2023 |
6/20/2023 |
5.0 |
Jane Doe |
28.0 |
Female |
Canadian |
Resort |
800 |
Flight |
500 |
2 |
3 |
Bali, Indonesia |
7/1/2023 |
7/8/2023 |
7.0 |
David Lee |
45.0 |
Male |
Korean |
Villa |
1000 |
Flight |
700 |
3 |
4 |
New York, USA |
8/15/2023 |
8/29/2023 |
14.0 |
Sarah Johnson |
29.0 |
Female |
British |
Hotel |
2000 |
Flight |
1000 |
4 |
5 |
Tokyo, Japan |
9/10/2023 |
9/17/2023 |
7.0 |
Kim Nguyen |
26.0 |
Female |
Vietnamese |
Airbnb |
700 |
Train |
200
|
Now, in this step, we will examine the traveler trip dataset, so that we can learn about its composition. By doing so, we will be able to verify data types, null values, and fundamental statistics prior to preprocessing.
Use the code given below to understand the dataset:
# Shape of the dataset (rows, columns)
print("Dataset shape:", df.shape)
# Data types and non-null info
print("\nInfo:")
df.info()
# Summary statistics for numerical columns
print("\nSummary statistics:")
df.describe()
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
Output:
Dataset shape: (139, 13)
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139 entries, 0 to 138
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Trip ID 139 non-null int64
1 Destination 137 non-null object
2 Start date 137 non-null object
3 End date 137 non-null object
4 Duration (days) 137 non-null float64
5 Traveler name 137 non-null object
6 Traveler age 137 non-null float64
7 Traveler gender 137 non-null object
8 Traveler nationality 137 non-null object
9 Accommodation type 137 non-null object
10 Accommodation cost 137 non-null object
11 Transportation type 136 non-null object
12 Transportation cost 136 non-null object
dtypes: float64(2), int64(1), object(10)
memory usage: 14.2+ KB
Summary statistics:
Missing values:
Trip ID 0
Destination 2
Start date 2
End date 2
Duration (days) 2
Traveler name 2
Traveler age 2
Traveler gender 2
Traveler nationality 2
Accommodation type 2
Accommodation cost 2
Transportation type 3
Transportation cost 3
dtype: int64
What does the output tell us?
The output tells us that -
In this step, we will clean the data. Use the below-mentioned code to do so:
# Use regex to remove all non-numeric characters except dot (.)
df['Accommodation cost'] = df['Accommodation cost'].str.replace(r'[^\d.]', '', regex=True).astype(float)
df['Transportation cost'] = df['Transportation cost'].str.replace(r'[^\d.]', '', regex=True).astype(float)
# Confirm the conversion
print("Accommodation cost type:", df['Accommodation cost'].dtype)
print("Transportation cost type:", df['Transportation cost'].dtype)
Output:
Accommodation cost type: float64
Transportation cost type: float64
Now we have clean columns. Let's handle missing values. To do so, we will:
Here is the code to achieve all this:
# Check missing values
print(df.isnull().sum())
# Impute numerical columns using median
df['Accommodation cost'] = df['Accommodation cost'].fillna(df['Accommodation cost'].median())
df['Transportation cost'] = df['Transportation cost'].fillna(df['Transportation cost'].median())
df['Traveler age'] = df['Traveler age'].fillna(df['Traveler age'].median())
# Impute categorical columns using mode
df['Traveler gender'] = df['Traveler gender'].fillna(df['Traveler gender'].mode()[0])
df['Transportation type'] = df['Transportation type'].fillna(df['Transportation type'].mode()[0])
# Drop remaining rows with missing values
df.dropna(inplace=True)
# Final check
print("\nMissing values after cleaning:")
print(df.isnull().sum())
Output:
Trip ID 0
Destination 2
Start date 2
End date 2
Duration (days) 2
Traveler name 2
Traveler age 2
Traveler gender 2
Traveler nationality 2
Accommodation type 2
Accommodation cost 0
Transportation type 3
Transportation cost 0
dtype: int64
Missing values after cleaning:
Trip ID 0
Destination 0
Start date 0
End date 0
Duration (days) 0
Traveler name 0
Traveler age 0
Traveler gender 0
Traveler nationality 0
Accommodation type 0
Accommodation cost 0
Transportation type 0
Transportation cost 0
dtype: int64
Now that the data is clean, we will improve model performance and analysis by extracting valuable features from existing columns. Our objective in this step is to:
Note: We will handle encoding categorical variables later during the model preparation phase.
Use the code mentioned below to accomplish all this:
# Calculate duration (in days) from start and end dates
df['Start date'] = pd.to_datetime(df['Start date'])
df['End date'] = pd.to_datetime(df['End date'])
df['Duration (days)'] = (df['End date'] - df['Start date']).dt.days
# Extract features from Start date
df['Travel Month'] = df['Start date'].dt.month
df['Travel Year'] = df['Start date'].dt.year
df['Travel Day'] = df['Start date'].dt.day
# Optional: encode gender (if required for modeling)
# df['Traveler gender'] = df['Traveler gender'].map({'Male': 0, 'Female': 1})
# Preview the engineered features
print(df[['Start date', 'End date', 'Duration (days)', 'Travel Day', 'Travel Month', 'Travel Year']].head())
Output:
Start date End date Duration (days) Travel Day Travel Month \
0 2023-05-01 2023-05-08 7 1 5
1 2023-06-15 2023-06-20 5 15 6
2 2023-07-01 2023-07-08 7 1 7
3 2023-08-15 2023-08-29 14 15 8
4 2023-09-10 2023-09-17 7 10 9
Travel Year
0 2023
1 2023
2 2023
3 2023
4 2023
In order to identify trends in trip length, monthly travel frequency, and cost relationships, we will employ visualizations in this step. We can better comprehend seasonality, preferences, and budgetary behavior with the aid of these plots.
Use the below-mentioned code:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15, 12))
# Plot 1: Distribution of Trip Durations
plt.subplot(3, 1, 1)
sns.histplot(df['Duration (days)'], bins=10, kde=True, color='skyblue')
plt.title("Distribution of Trip Durations")
plt.xlabel("Duration (days)")
plt.ylabel("Number of Trips")
# Plot 2: Number of Trips Per Month
plt.subplot(3, 1, 2)
sns.countplot(x='Travel Month', data=df, palette='Set2')
plt.title("Trips per Month")
plt.xlabel("Month")
plt.ylabel("Number of Trips")
# Plot 3: Accommodation vs Transportation Cost
plt.subplot(3, 1, 3)
sns.scatterplot(data=df, x='Accommodation cost', y='Transportation cost', hue='Travel Month', palette='viridis')
plt.title("Accommodation vs Transportation Cost")
plt.xlabel("Accommodation Cost")
plt.ylabel("Transportation Cost")
plt.legend(title='Month')
plt.tight_layout()
plt.show()
Output:
Popular Data Science Programs
Let’s visualize travel trends before examining the relationships between numerical features. We will examine the relationships between characteristics such as trip duration, travel cost, and booking amount using a correlation matrix. This aids in locating highly correlated fields that could cause redundancy or have an impact on the prediction model.
The code to accomplish this is as follows:
# Select only numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=['int64', 'float64'])
# Generate correlation matrix
corr_matrix = numeric_df.corr()
# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Features")
plt.show()
Output:
In summary, spending habits (travel and lodging) are more closely linked to one another than to age or length of stay.
Now, we apply one-hot encoding to transform all our categorical variables in the data set into numerical values so that they will be model-ready. This thus creates new binary columns for every category. Once again, to make the resultant boolean values friendly with our machine learning models, we shall convert them to integers (1 or 0).
Use the code listed below:
# List of categorical columns to encode
categorical_cols = ['Destination', 'Traveler gender', 'Traveler nationality',
'Accommodation type', 'Transportation type']
# Apply one-hot encoding and convert booleans to integers
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
# Convert all boolean columns to integers (ML models expect numeric input)
df_encoded = df_encoded.astype(int, errors='ignore')
# Preview the encoded DataFrame
print("Final Encoded DataFrame shape:", df_encoded.shape)
df_encoded.head()
Output:
Final Encoded DataFrame shape: (137, 125)
Trip ID |
Start date |
End date |
Duration (days) |
Traveler name |
Traveler age |
Accommodation cost |
Transportation cost |
Travel Month |
Travel Year |
... |
Accommodation type_Vacation rental |
Accommodation type_Villa |
Transportation type_Bus |
Transportation type_Car |
Transportation type_Car rental |
Transportation type_Ferry |
Transportation type_Flight |
Transportation type_Plane |
Transportation type_Subway |
Transportation type_Train |
|
0 |
1 |
1682899200000000000 |
1683504000000000000 |
7 |
John Smith |
35 |
1200 |
600 |
5 |
2023 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
2 |
1686787200000000000 |
1687219200000000000 |
5 |
Jane Doe |
28 |
800 |
500 |
6 |
2023 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
2 |
3 |
1688169600000000000 |
1688774400000000000 |
7 |
David Lee |
45 |
1000 |
700 |
7 |
2023 |
... |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
3 |
4 |
1692057600000000000 |
1693267200000000000 |
14 |
Sarah Johnson |
29 |
2000 |
1000 |
8 |
2023 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
4 |
5 |
1694304000000000000 |
1694908800000000000 |
7 |
Kim Nguyen |
26 |
700 |
200 |
9 |
2023 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
5 rows × 125 columns
In this step, we will split the dataset into training and testing subsets. Here is the code to do so:
from sklearn.model_selection import train_test_split
# Define input features (X) and target variable (y)
X = df_encoded.drop(['Duration (days)'], axis=1)
y = df_encoded['Duration (days)']
# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
Output:
Training set shape: (109, 124)
Test set shape: (28, 124)
In this step, we will train the following regression models to predict trip duration based on customer and trip features:
We will evaluate them using - R² Score and Mean Squared Error (MSE). Here is the code
Here is the code:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Drop non-numeric columns like 'Traveler name' or IDs
X = df_encoded.drop(["Duration (days)", "Traveler name"], axis=1)
y = df_encoded["Duration (days)"]
# Train-test split (already done earlier, repeating here for safety)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
# Initialize models
models = {
"Linear Regression": LinearRegression(),
"Decision Tree": DecisionTreeRegressor(random_state=42),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}
# Train and evaluate each model
for name, model in models.items():
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"\nModel: {name}")
print("Mean Absolute Error (MAE):", round(mean_absolute_error(y_test, predictions), 2))
print("Mean Squared Error (MSE):", round(mean_squared_error(y_test, predictions), 2))
print("R² Score:", round(r2_score(y_test, predictions), 2))
Output:
Training set shape: (109, 123)
Test set shape: (28, 123)
Model: Linear Regression
Mean Absolute Error (MAE): 0.0
Mean Squared Error (MSE): 0.0
R² Score: 1.0
Model: Decision Tree
Mean Absolute Error (MAE): 1.36
Mean Squared Error (MSE): 4.43
R² Score: -1.05
Model: Random Forest
Mean Absolute Error (MAE): 1.13
Mean Squared Error (MSE): 1.95
R² Score: 0.1
Model: XGBoost
Mean Absolute Error (MAE): 1.35
Mean Squared Error (MSE): 3.27
R² Score: -0.52
For the prediction of travel duration, this project trained and then evaluated four regression models, such as Linear Regression, Decision Tree Regressor, Random Forest Regressor, and XGBoost Regressor.
Overall, Linear Regression is the best choice for this dataset.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://colab.research.google.com/drive/1YIxiVyxdWY2GYXewHa6O41Vjrk0rvYM4?usp=sharing
802 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources