Temporal Fraud Detection in Bitcoin Transaction Networks

A machine learning research project investigating the impact of observation windows on fraud detection in the Elliptic Bitcoin transaction dataset using graph neural networks and traditional ML baselines.

Overview

This project addresses temporal fraud detection in Bitcoin transaction networks using the Elliptic dataset (13.7 GB). The core research investigates whether delaying node classification by observing nodes for K additional timesteps after their first appearance improves fraud detection accuracy.

Key Innovation: Rigorous temporal methodology with non-overlapping cohorts and per-node observation windows to prevent information leakage while exploring the trade-off between observation delay and classification accuracy.

Research Question

Does delaying node classification (waiting K timesteps after first appearance) improve fraud detection accuracy?

We investigate this across multiple model families:

Baseline Models: Logistic Regression, Random Forest, XGBoost (feature aggregation only)
MLP + Graph Features: Neural network with structural graph features (centrality, PageRank, etc.)
Static GCN: Graph convolutional network on static graph snapshots
Temporal GCN: EvolveGCN-style model with LSTM to capture temporal dynamics

Observation Windows Tested: K ∈ {1, 3, 5, 7} timesteps

Dataset

Elliptic Bitcoin Transaction Dataset

Size: 13.7 GB (tracked via Git LFS)
Nodes: ~203,769 Bitcoin wallets
Edges: Transaction relationships between wallets
Timesteps: 49 discrete time periods
Labels: Illicit (fraud, scams, ransomware) vs. Licit wallets
Features: 166 transaction features per wallet (reduced to 36 after correlation analysis)
Class Distribution: ~5-8% illicit (highly imbalanced)

Temporal Data Split

Split	Timesteps	Nodes	Illicit %
Train	5-26	104,704	6.4%
Validation	27-31	11,230	7.2%
Test	32-40	45,963	8.0%

Critical: Gaps built between splits to prevent temporal information leakage. Each node is evaluated at exactly t_first(v) + K where t_first(v) is the node's first appearance timestep.

Methodology

Core Protocol

Per-Node Observation Windows:

For each node v appearing at timestep t_first(v), classify using data from timesteps t_first(v) through t_first(v) + K
Ensures equal observation windows across all nodes
Prevents temporal leakage (no future information)
Enables fair comparison across different K values

Temporal Edge Weighting

Edges are weighted using exponential temporal decay:

S_ji(t) = Σ A_ji^(s) × exp(-λ(t-s))

where:

A_ji^(s): Binary edge indicator at timestep s
λ: Decay rate parameter (controls memory)
Temperature-softmax normalization for final edge weights

Per-Cohort Training (Temporal Models)

For temporal GNNs:

Reset model state at start of each cohort
Feed graph sequence: t, t+1, ..., t+K
Compute loss only on nodes in current cohort
Prevents information leakage across training examples

Feature Reduction

Original: 166 features → 36 features after removing highly correlated features (Pearson correlation > 0.95)
Added: 1 temporal age feature (normalized timestep of first appearance)
Graph features (MLP baseline): 7 structural features including PageRank, degree centrality, betweenness centrality

For complete methodology details, see METHODOLOGY.md.

Project Structure

graph_ml/
├── code_lib/                          # Core library modules
│   ├── temporal_node_classification_builder.py  # Main graph builder (1008 lines)
│   ├── temporal_edge_builder.py       # Edge construction with decay weighting
│   ├── temporal_graph_builder.py      # PyG Temporal conversion utilities
│   └── utils.py                       # Data loading helpers
│
├── elliptic_dataset/                  # Bitcoin transaction dataset (13.7 GB)
│   ├── wallets_features_until_t.csv  # Temporal features (no leakage)
│   ├── wallets_features.csv          # Wallet features
│   └── AddrTxAddr_edgelist_*.csv     # Transaction edges (8 parts)
│
├── notebooks/
│   ├── experiments/                   # Main experiment notebooks
│   │   ├── evolve_gcn.ipynb          # Temporal GCN experiments
│   │   ├── static_gcn.ipynb          # Static GCN experiments
│   │   ├── baselines.ipynb           # Traditional ML baselines
│   │   ├── graph_features_baseline.ipynb  # MLP + graph features
│   │   └── model_comparison_visualization.ipynb  # Results visualization
│   └── other/                        # Exploratory analysis
│
├── results/                          # Experimental results
│   ├── evolve_gcn_multi_seed/       # Temporal GCN (seeds: 42, 123, 456)
│   ├── static_gcn_multi_seed/       # Static GCN (seeds: 42, 123, 456)
│   ├── baselines/                   # Logistic Regression, RF, XGBoost
│   │   ├── logistic_regression/
│   │   ├── random_forest/
│   │   └── xgboost/
│   ├── graph_features_baseline/     # MLP + graph features
│   └── comparison_formatted.csv     # Unified results comparison
│
├── graph_cache/                      # Cached graph snapshots
├── tests/                           # Unit tests
└── [Documentation files]

Models & Experiments

1. Traditional ML Baselines

Models: Logistic Regression, Random Forest, XGBoost Features: 36 reduced transaction features Training: Per-K retraining for proper calibration

2. MLP + Graph Features

Architecture: Multi-layer perceptron Features: 36 reduced features + 7 graph structural features

Total/in/out degree
PageRank
Betweenness centrality
Degree ratio
Normalized degree centrality

3. Static GCN

Architecture: 2-layer Graph Convolutional Network Training: Multi-seed (42, 123, 456) Features: 36 reduced + 1 temporal age feature Graph: Static snapshot at evaluation time

4. Temporal GCN (EvolveGCN-style)

Architecture: 2-layer GCN + LSTM + classifier Training: Multi-seed (42, 123, 456), per-cohort with state reset Features: 36 reduced + 1 temporal age feature Graph: Temporal sequence with weighted edges

Results

Summary Statistics (Test Set)

All results reported as Mean ± Std across 3 random seeds (for GNN models).

Graph Neural Networks

Model	K	F1 Score	AUC	Precision	Recall
Temporal GCN	1	0.312 ± 0.065	0.753 ± 0.043	0.229 ± 0.050	0.502 ± 0.127
Temporal GCN	3	0.338 ± 0.042	0.732 ± 0.062	0.263 ± 0.021	0.500 ± 0.160
Temporal GCN	5	0.301 ± 0.019	0.679 ± 0.015	0.231 ± 0.017	0.435 ± 0.025
Temporal GCN	7	0.332 ± 0.034	0.782 ± 0.003	0.233 ± 0.029	0.580 ± 0.036
Static GCN	1	0.168 ± 0.156	0.509 ± 0.154	0.133 ± 0.153	0.476 ± 0.477
Static GCN	3	0.123 ± 0.054	0.603 ± 0.013	0.460 ± 0.354	0.259 ± 0.361
Static GCN	5	0.179 ± 0.027	0.548 ± 0.042	0.185 ± 0.094	0.444 ± 0.480
Static GCN	7	0.166 ± 0.059	0.590 ± 0.050	0.151 ± 0.053	0.345 ± 0.380

Traditional ML Baselines

Model	K	F1 Score	AUC	Precision	Recall
Logistic Regression	1	0.249	0.875	0.143	0.958
Logistic Regression	3	0.251	0.874	0.145	0.958
Logistic Regression	5	0.247	0.869	0.142	0.958
Logistic Regression	7	0.245	0.865	0.140	0.960
Random Forest	1	0.824	0.930	0.986	0.707
Random Forest	3	0.819	0.915	0.988	0.699
Random Forest	5	0.824	0.925	0.989	0.707
Random Forest	7	0.821	0.924	0.988	0.702
XGBoost	1	0.788	0.943	0.814	0.763
XGBoost	3	0.803	0.942	0.837	0.771
XGBoost	5	0.806	0.948	0.846	0.770
XGBoost	7	0.783	0.949	0.825	0.745

MLP + Graph Features

Model	K	F1 Score	AUC	Precision	Recall
MLP + Graph Features	1	0.233	0.712	0.133	0.909
MLP + Graph Features	3	0.234	0.685	0.134	0.908
MLP + Graph Features	5	0.234	0.686	0.134	0.909
MLP + Graph Features	7	0.233	0.691	0.134	0.909

Performance Visualization

Figure: Comprehensive comparison of graph-based model performance across different observation windows (K) showing F1 scores, AUC, precision, and recall metrics for all tested models.

Key Findings

Best Overall Performance: Random Forest and XGBoost significantly outperform GNN models (F1 ~0.80-0.82 vs. 0.30-0.34), suggesting transaction features are more informative than graph structure for this task
Temporal GCN vs. Static GCN: Temporal models consistently outperform static models, validating the importance of temporal dynamics
Observation Window Effects:
- Non-monotonic relationship with K
- Temporal GCN shows best performance at K=7 (AUC 0.782)
- XGBoost peaks at K=5 (F1 0.806)
- Random Forest relatively stable across K values
Class Imbalance Challenge: High recall but low precision in many models reflects the severe class imbalance (~5-8% illicit)
Model Stability: Multi-seed experiments reveal variable stability:
- Temporal GCN: relatively stable (std ~0.02-0.06 for F1)
- Static GCN: high variance (std up to 0.16 for F1)

Setup & Installation

Prerequisites

Python 3.11+
CUDA-capable GPU (recommended) or CPU/MPS support
Git LFS for large dataset files

1. Dataset Setup

# Install git-lfs
brew install git-lfs  # macOS
# or
apt-get update && apt-get install git-lfs  # Linux

# Initialize LFS
git lfs install

# Pull large files
git lfs pull

2. Environment Setup

# Create conda environment
conda env create -f env.yml

# Activate environment
conda activate graph_ml

3. Dependencies

Key packages:

PyTorch (CPU/CUDA/MPS)
PyTorch Geometric
PyTorch Geometric Temporal
scikit-learn
XGBoost
pandas, numpy, matplotlib
JupyterLab

Usage

Running Experiments

Temporal GCN: Open notebooks/experiments/evolve_gcn.ipynb
Static GCN: Open notebooks/experiments/static_gcn.ipynb
Baselines: Open notebooks/experiments/baselines.ipynb
MLP + Graph Features: Open notebooks/experiments/graph_features_baseline.ipynb
Visualize Results: Open notebooks/experiments/model_comparison_visualization.ipynb

Graph Building API

from code_lib.temporal_node_classification_builder import TemporalNodeClassificationGraphBuilder

# Initialize builder
builder = TemporalNodeClassificationGraphBuilder(
    observation_windows=[1, 3, 5, 7],
    use_cache=True
)

# Build graphs for static models
train_data, val_data, test_data = builder.prepare_observation_window_graphs(K=3)

# Build sequences for temporal models
temporal_graphs = builder.prepare_temporal_model_graphs(K=3)

Caching System

Graph snapshots are automatically cached to graph_cache/ for faster repeated experiments. Cache keys include all relevant configuration parameters to ensure consistency.

RunPod Cluster Setup

For collaborative development on GPU cluster:

Go to RunPod and select "Franek's Team" from the dropdown
Navigate to "Pods" tab
Select the "GraphML" storage volume
Click "Change Template" and select "graph_ml_updated"
Click "Deploy On-Demand"
Once deployed, click "jupyterlab" to access JupyterLab in browser

Important:

Do NOT delete the storage volume
Remember to terminate the pod after use to avoid idle costs
Git credentials are pre-configured in the network volume

Citation

If you use this code or methodology in your research, please cite the Elliptic dataset:

Weber, M., Domeniconi, G., Chen, J., Weidele, D. K. I., Bellei, C., Robinson, T., & Leiserson, C. E. (2019).
Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics.
KDD '19 Workshop on Anomaly Detection in Finance.

License

This project is for academic research purposes. Please ensure proper attribution when using this code or methodology.

Contact

For questions or issues, please open an issue in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
code_lib		code_lib
elliptic_dataset		elliptic_dataset
misc		misc
notebooks		notebooks
results		results
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

License

flatala/graph_ml

Folders and files

Latest commit

History

Repository files navigation

Temporal Fraud Detection in Bitcoin Transaction Networks

Table of Contents

Overview

Research Question

Dataset

Elliptic Bitcoin Transaction Dataset

Temporal Data Split

Methodology

Core Protocol

Temporal Edge Weighting

Per-Cohort Training (Temporal Models)

Feature Reduction

Project Structure

Models & Experiments

1. Traditional ML Baselines

2. MLP + Graph Features

3. Static GCN

4. Temporal GCN (EvolveGCN-style)

Results

Summary Statistics (Test Set)

Graph Neural Networks

Traditional ML Baselines

MLP + Graph Features

Performance Visualization

Key Findings

Setup & Installation

Prerequisites

1. Dataset Setup

2. Environment Setup

3. Dependencies

Usage

Running Experiments

Graph Building API

Caching System

RunPod Cluster Setup

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages