Fraud Detection in Transactions with Python: A Machine Learning Project
By Rohit Sharma
Updated on Jul 28, 2025 | 10 min read | 1.24K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 28, 2025 | 10 min read | 1.24K+ views
Share:
Table of Contents
With the rise of digital payments, fraudulent transactions have become more frequent and harder to detect. Traditional systems often fail to catch evolving fraud patterns.
In this Fraud Detection in Transactions project, you’ll solve that issue by using real-world credit card transaction data to train models that can identify suspicious behavior. You'll apply techniques such as anomaly detection, isolation forests, and deep learning to classify transactions as fraudulent or genuine.
Want to turn skills into a career? Learn Python, Machine Learning, and more with upGrad’s job-ready Data Science Courses, built to get you hired faster. Explore now.
Build confidence through code. Explore top Python data science projects and start creating work that stands out to recruiters.
Popular Data Science Programs
Before starting this Fraud Detection in Transactions project, it is helpful to have a basic understanding of the following concepts and tools:
Also Read: PyTorch vs TensorFlow: Making the Right Choice for 2025!
Level up your data science game with upGrad’s top-rated courses. Get mentored by industry pros, build real skills, and fast-track your path to a standout career.
This project is perfect if you’re comfortable with Python and want practical experience detecting fraud in real-world transaction data. You’ll learn how to identify fraud patterns, apply anomaly detection techniques, and build machine learning models
Here are the key tools and Python libraries we’ll use to build and evaluate the Fraud Detection system:
Tool / Library |
Purpose |
Python | Core language for building the end-to-end fraud detection pipeline |
NumPy | Efficient handling of arrays and numerical computations |
Pandas | Loading, cleaning, and exploring transaction datasets |
Scikit-learn | Implementing machine learning models and anomaly detection techniques |
TensorFlow | Designing and training deep learning models for advanced fraud detection |
Google Colab | Cloud-based environment to run, test, and visualize your project |
Also Read - Step-by-Step Guide to Learning Python for Data Science
To identify fraudulent transactions effectively, this project uses machine learning and anomaly detection techniques tailored for financial data. Here’s what the approach focuses on:
Also Read - Explaining 5 Layers of Convolutional Neural Network
This section guides you through each stage of building a Fraud Detection in Transactions model using machine learning and deep learning methods:
So now let’s get started with detecting the fraud in transactions.
Download customer data from Kaggle by searching " Fraud Detection in Transactions," downloading the ZIP file, extracting it, and using the CSV file for analysis.
Now, after downloading the dataset, move to the next step.
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
We start by importing important libraries and loading the dataset.
import pandas as pd
# Load the dataset from CSV file
df = pd.read_csv('transaction.csv')
# Check class distribution to see how balanced the dataset is
# 'Class' column: 0 = Normal Transaction, 1 = Fraud
print("\nClass distribution:\n", df['Class'].value_counts(normalize=True))
# Preview the first 5 rows of the dataset
print("\nFirst 5 rows:\n", df.head())
Output :
Class distribution:
Class
0 0.998273
1 0.001727
Name: proportion, dtype: float64
First 5 rows:
Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 ... V21 V22 V23 V24 V25 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010
V26 V27 V28 Amount Class
0 -0.189115 0.133558 -0.021053 149.62 0
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0
Also Read - Top 6 Python IDEs of 2025 That Will Change Your Workflow!
Before training the model, we need to separate the target label (Class) from the features, scale the values for consistency, and split the dataset into training and testing sets.
The code for this step is below :
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Step 1: Separate features and target label
X = df.drop('Class', axis=1) # Features (all columns except 'Class')
y = df['Class'] # Target (0 for normal, 1 for fraud)
# Step 2: Scale the features
# Scaling standardizes the values to have mean 0 and standard deviation 1
# This helps models converge faster and perform better
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Split the data into training and testing sets
# Stratified split ensures the same proportion of fraud cases in both sets
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y,
test_size=0.2, # 20% of data goes to the test set
stratify=y, # Maintain class balance
random_state=42 # Reproducible results
)
This setup prepares the data for feeding into machine learning models.
In this step, we apply the Isolation Forest algorithm: an unsupervised method used for anomaly detection. It works well when fraudulent transactions are rare and behave differently than the majority of data.
Here is the code for detecting:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix
# Step 1: Initialize the Isolation Forest model
# n_estimators = number of trees
# contamination = expected proportion of frauds (here, ~1%)
iso_forest = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
# Step 2: Train the model on the training data (unsupervised)
iso_forest.fit(X_train)
# Step 3: Predict on the test data
# Output: -1 indicates anomaly (possible fraud), 1 indicates normal
y_pred_iso = iso_forest.predict(X_test)
# Step 4: Convert predictions to match target format
# 1 = fraud, 0 = normal
y_pred_iso = [1 if p == -1 else 0 for p in y_pred_iso]
# Step 5: Evaluate performance using classification metrics
print("Classification Report (Isolation Forest):")
print(classification_report(y_test, y_pred_iso, digits=4)) # Precision, recall, f1-score
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_iso)) # Shows true/false positives and negatives
Output:
Classification Report (Isolation Forest):
precision recall f1-score support
0 0.9994 0.9903 0.9948 56864
1 0.1052 0.6633 0.1816 98
accuracy 0.9897 56962
macro avg 0.5523 0.8268 0.5882 56962
weighted avg 0.9979 0.9897 0.9934 56962
Confusion Matrix:
[[56311 553]
[ 33 65]]
Also Read - CNN vs. RNN: Key Differences and Applications Explained
Now we build a neural network using TensorFlow/Keras to classify transactions as fraudulent or not. This is a supervised binary classification model using a dense feedforward network.
Here is the code:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
# Step 1: Define the architecture of the neural network
model = Sequential([
Dense(32, activation='relu', input_shape=(X_train.shape[1],)), # Input layer with 32 units
Dropout(0.2), # Dropout to prevent overfitting
Dense(64, activation='relu'), # Hidden layer with 64 units
Dropout(0.3), # More dropout
Dense(32, activation='relu'), # Another hidden layer
Dense(1, activation='sigmoid') # Output layer for binary classification (fraud or not)
])
# Step 2: Compile the model
# - Adam optimizer: adaptive learning rate
# - Binary crossentropy: loss for binary classification
# - Accuracy: to monitor model performance
model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
# Step 3: Use EarlyStopping to avoid overfitting
# Stops training if validation loss doesn’t improve for 3 consecutive epochs
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
# Step 4: Train the model on training data
# - validation_split: 20% of training data used for validation
# - batch_size: number of samples processed before model update
# - verbose=1: prints training progress
history = model.fit(
X_train, y_train,
epochs=15,
batch_size=512,
validation_split=0.2,
callbacks=[early_stop],
verbose=1
)
Output:
357/357 ━━━━━━━━━━━━━━━━━━━━ 4s 6ms/step - accuracy: 0.9842 - loss: 0.1136 - val_accuracy: 0.9982 - val_loss: 0.0055
Epoch 2/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 2s 6ms/step - accuracy: 0.9989 - loss: 0.0050 - val_accuracy: 0.9994 - val_loss: 0.0039
Epoch 3/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 3s 8ms/step - accuracy: 0.9993 - loss: 0.0041 - val_accuracy: 0.9994 - val_loss: 0.0038
Epoch 4/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 2s 5ms/step - accuracy: 0.9993 - loss: 0.0034 - val_accuracy: 0.9994 - val_loss: 0.0036
Epoch 5/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 2s 5ms/step - accuracy: 0.9991 - loss: 0.0037 - val_accuracy: 0.9994 - val_loss: 0.0035
Epoch 6/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 2s 5ms/step - accuracy: 0.9993 - loss: 0.0035 - val_accuracy: 0.9994 - val_loss: 0.0035
Epoch 7/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - accuracy: 0.9991 - loss: 0.0035 - val_accuracy: 0.9995 - val_loss: 0.0035
Epoch 8/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 3s 7ms/step - accuracy: 0.9994 - loss: 0.0030 - val_accuracy: 0.9995 - val_loss: 0.0035
Epoch 9/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 3s 7ms/step - accuracy: 0.9993 - loss: 0.0032 - val_accuracy: 0.9994 - val_loss: 0.0036
Epoch 10/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 3s 9ms/step - accuracy: 0.9993 - loss: 0.0035 - val_accuracy: 0.9994 - val_loss: 0.0035
Epoch 11/15
357/357 ━━━━━━━━━━━━━━━━━━━━ 2s 5ms/step - accuracy: 0.9994 - loss: 0.0030 - val_accuracy: 0.9995 - val_loss: 0.0036
Conclusion: This model learns the patterns of normal and fraudulent transactions to accurately classify future transactions.
To boost performance and prevent overfitting, we apply random rotations and zooms during training.
Here is the Code:
# Step 1: Evaluate the trained model on test data
# - Returns loss and accuracy
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {accuracy:.4f}")
# Step 2: Predict class probabilities on the test set
# - Model outputs probabilities between 0 and 1
y_pred_dl = model.predict(X_test)
# Step 3: Convert probabilities to binary class labels
# - Threshold of 0.5: if probability > 0.5, predict fraud (1); else, normal (0)
y_pred_dl = [1 if prob > 0.5 else 0 for prob in y_pred_dl]
# Step 4: Evaluate using classification metrics
from sklearn.metrics import classification_report, confusion_matrix
print("Classification Report (Deep Learning):")
print(classification_report(y_test, y_pred_dl, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dl))
Output:
Classification Report (Deep Learning):
precision recall f1-score support
0 0.9997 0.9997 0.9997 56864
1 0.8247 0.8163 0.8205 98
accuracy 0.9994 56962
macro avg 0.9122 0.9080 0.9101 56962
weighted avg 0.9994 0.9994 0.9994 56962
Confusion Matrix:
[[56847 17]
[ 18 80]]
Let’s plot how the model performed during training in terms of accuracy and loss over epochs. This helps you see whether the model overfitted or underfitted.
import matplotlib.pyplot as plt
# Function to plot training and validation accuracy/loss
def plot_training_history(history):
plt.figure(figsize=(12, 4))
# Plot 1: Accuracy over epochs
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy') # Training accuracy
plt.plot(history.history['val_accuracy'], label='Val Accuracy') # Validation accuracy
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Model Accuracy')
plt.legend()
# Plot 2: Loss over epochs
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss') # Training loss
plt.plot(history.history['val_loss'], label='Val Loss') # Validation loss
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Model Loss')
plt.legend()
# Layout adjustment
plt.tight_layout()
plt.show()
# Call the function to display plots
plot_training_history(history)
Output:
Conclusion-From the training history plots:
We built and evaluated two models: Isolation Forest and a neural network, to detect fraudulent transactions. After preprocessing and scaling the data, both models were trained and tested.
Isolation Forest gave a quick unsupervised baseline, while the deep learning model achieved higher accuracy and handled class imbalance better. Overall, the project showed how combining preprocessing, anomaly detection, and neural networks can effectively flag fraud in transaction data.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link-
https://colab.research.google.com/drive/1PEQF-F3GZH7Y-90KyEY1B-GlcsJHptV4?usp=sharing
805 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources