💳 Fraud Detection — Machine Learning Pipeline

📌 Project Overview

This project tackles a highly imbalanced credit card fraud detection problem, where the goal is to maximize fraud detection while controlling false positive alerts that negatively impact customer experience.

Fraud detection is treated as a cost-sensitive classification problem, where false negatives (missed fraud) typically incur significantly higher cost than false positives.

📊 Dataset

The dataset contains anonymized credit card transactions:

Features V1–V28 are PCA-transformed
Time and Amount represent transaction time and value
Target variable Class:
- 0 → Normal transaction
- 1 → Fraudulent transaction

Fraud cases represent approximately 0.17% of the dataset, making traditional accuracy-based evaluation misleading.

The dataset is excluded from version control and should be placed locally under the data/ directory.
Dataset source: Kaggle — Credit Card Fraud Dataset.

🗂️ Project Structure

Trained models and preprocessing artifacts are persisted locally for reproducibility but are excluded from version control.

fraud-detection-ml/
├── data/       (local only, excluded from version control)
│
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_modeling.ipynb
│   ├── 04_evaluation.ipynb
│   ├── 05_model_comparison.ipynb
│   └── 06_cost_evaluation.ipynb
│   └── 07_inference.ipynb
│
├── images/
│   └── pr_curve_comparison.png
│
├── requirements.txt
└── README.md

🔁 Machine Learning Pipeline

Raw Dataset (creditcard.csv)
        ↓
01_eda — Exploratory Data Analysis
        ↓
02_preprocessing
  • Stratified train/test split
  • Feature scaling (Time, Amount)
  • Numeric type enforcement
  • Artifact persistence
        ↓
03_modeling
  • Logistic Regression (baseline)
  • Random Forest
  • XGBoost
        ↓
04_evaluation
  • PR-AUC / ROC-AUC
  • Precision–Recall analysis
  • Baseline threshold selection
        ↓
05_model_comparison
  • Cross-model comparison
  • Model-specific threshold tuning
        ↓
06_cost_evaluation
  • Expected financial loss analysis
  • Cost-based threshold optimization
  • Final model selection
07_inference
    ↓
  • Production-ready inference
  • Config-driven thresholding
  • API-ready design

⚙️ Modeling Strategy

Severe class imbalance handled using class-weighted training
PR-AUC prioritized over accuracy due to extreme imbalance
Probability-based evaluation used instead of hard predictions
Threshold tuning aligned with operational and business risk
Preprocessing, modeling, and evaluation are fully decoupled to resemble real-world ML pipelines
Different probability distributions across models required model-specific threshold selection

📊 Final Model Comparison

Model	PR-AUC	Threshold	Precision	Recall	False Positives
Logistic Regression	0.716	0.70	0.12	0.91	644
Random Forest	0.854	0.35	0.94	0.81	5
XGBoost	0.861	0.50	0.67	0.86	41

💰 Cost-Based Model Selection

Beyond statistical performance, models were evaluated using a business-oriented cost framework, where:

False negatives represent missed fraud losses
False positives represent operational and customer experience costs

A cost-sensitive threshold analysis demonstrated that XGBoost achieves the lowest expected financial loss while maintaining strong fraud recall and manageable alert volume.

➡️ XGBoost was selected as the final production candidate based on expected business impact, not metric maximization alone.

Key takeaway: In real-world fraud detection systems, the optimal model is defined by business trade-offs rather than accuracy or recall in isolation.

🔌 Inference & Deployment Design

The project includes a standalone, production-oriented inference module (inference_07.py) that demonstrates how the trained fraud detection model would be used in a real-world system.

Key Design Decisions

Inference logic is fully separated from training and evaluation code
Trained model and preprocessing artifacts are loaded explicitly
Feature schema and ordering are strictly enforced to match training-time inputs
Missing features are handled defensively to ensure robust inference behavior
Decision threshold is externalized via a model configuration file
Business decision logic is decoupled from model code

🛡️ Config-Driven Decision Logic

The fraud decision threshold is not hard-coded. Instead, it is loaded from an external configuration file (model_config.json) that represents business risk tolerance and cost considerations.

This allows decision policies to be updated safely without modifying inference code.

API Readiness

The inference module is intentionally implemented as pure Python functions, making it easy to wrap with an API layer (e.g., FastAPI or Flask) without changing core business logic.

This design reflects common production patterns used in deployed ML systems.

📈 Precision–Recall Curve Comparison

🛠️ Tech Stack

Python
scikit-learn
XGBoost
NumPy / Pandas
Matplotlib

👤 Author

Mohamed Saad

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
images		images
notebooks		notebooks
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💳 Fraud Detection — Machine Learning Pipeline

📌 Project Overview

📊 Dataset

🗂️ Project Structure

🔁 Machine Learning Pipeline

⚙️ Modeling Strategy

📊 Final Model Comparison

💰 Cost-Based Model Selection

🔌 Inference & Deployment Design

Key Design Decisions

🛡️ Config-Driven Decision Logic

API Readiness

📈 Precision–Recall Curve Comparison

🛠️ Tech Stack

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💳 Fraud Detection — Machine Learning Pipeline

📌 Project Overview

📊 Dataset

🗂️ Project Structure

🔁 Machine Learning Pipeline

⚙️ Modeling Strategy

📊 Final Model Comparison

💰 Cost-Based Model Selection

🔌 Inference & Deployment Design

Key Design Decisions

🛡️ Config-Driven Decision Logic

API Readiness

📈 Precision–Recall Curve Comparison

🛠️ Tech Stack

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages