A machine learning research project investigating the impact of observation windows on fraud detection in the Elliptic Bitcoin transaction dataset using graph neural networks and traditional ML baselines.
- Overview
- Research Question
- Dataset
- Methodology
- Project Structure
- Models & Experiments
- Results
- Setup & Installation
- Usage
- Documentation
This project addresses temporal fraud detection in Bitcoin transaction networks using the Elliptic dataset (13.7 GB). The core research investigates whether delaying node classification by observing nodes for K additional timesteps after their first appearance improves fraud detection accuracy.
Key Innovation: Rigorous temporal methodology with non-overlapping cohorts and per-node observation windows to prevent information leakage while exploring the trade-off between observation delay and classification accuracy.
Does delaying node classification (waiting K timesteps after first appearance) improve fraud detection accuracy?
We investigate this across multiple model families:
- Baseline Models: Logistic Regression, Random Forest, XGBoost (feature aggregation only)
- MLP + Graph Features: Neural network with structural graph features (centrality, PageRank, etc.)
- Static GCN: Graph convolutional network on static graph snapshots
- Temporal GCN: EvolveGCN-style model with LSTM to capture temporal dynamics
Observation Windows Tested: K ∈ {1, 3, 5, 7} timesteps
- Size: 13.7 GB (tracked via Git LFS)
- Nodes: ~203,769 Bitcoin wallets
- Edges: Transaction relationships between wallets
- Timesteps: 49 discrete time periods
- Labels: Illicit (fraud, scams, ransomware) vs. Licit wallets
- Features: 166 transaction features per wallet (reduced to 36 after correlation analysis)
- Class Distribution: ~5-8% illicit (highly imbalanced)
| Split | Timesteps | Nodes | Illicit % |
|---|---|---|---|
| Train | 5-26 | 104,704 | 6.4% |
| Validation | 27-31 | 11,230 | 7.2% |
| Test | 32-40 | 45,963 | 8.0% |
Critical: Gaps built between splits to prevent temporal information leakage. Each node is evaluated at exactly t_first(v) + K where t_first(v) is the node's first appearance timestep.
Per-Node Observation Windows:
- For each node
vappearing at timestept_first(v), classify using data from timestepst_first(v)throught_first(v) + K - Ensures equal observation windows across all nodes
- Prevents temporal leakage (no future information)
- Enables fair comparison across different K values
Edges are weighted using exponential temporal decay:
S_ji(t) = Σ A_ji^(s) × exp(-λ(t-s))
where:
A_ji^(s): Binary edge indicator at timestep sλ: Decay rate parameter (controls memory)- Temperature-softmax normalization for final edge weights
For temporal GNNs:
- Reset model state at start of each cohort
- Feed graph sequence:
t, t+1, ..., t+K - Compute loss only on nodes in current cohort
- Prevents information leakage across training examples
- Original: 166 features → 36 features after removing highly correlated features (Pearson correlation > 0.95)
- Added: 1 temporal age feature (normalized timestep of first appearance)
- Graph features (MLP baseline): 7 structural features including PageRank, degree centrality, betweenness centrality
For complete methodology details, see METHODOLOGY.md.
graph_ml/
├── code_lib/ # Core library modules
│ ├── temporal_node_classification_builder.py # Main graph builder (1008 lines)
│ ├── temporal_edge_builder.py # Edge construction with decay weighting
│ ├── temporal_graph_builder.py # PyG Temporal conversion utilities
│ └── utils.py # Data loading helpers
│
├── elliptic_dataset/ # Bitcoin transaction dataset (13.7 GB)
│ ├── wallets_features_until_t.csv # Temporal features (no leakage)
│ ├── wallets_features.csv # Wallet features
│ └── AddrTxAddr_edgelist_*.csv # Transaction edges (8 parts)
│
├── notebooks/
│ ├── experiments/ # Main experiment notebooks
│ │ ├── evolve_gcn.ipynb # Temporal GCN experiments
│ │ ├── static_gcn.ipynb # Static GCN experiments
│ │ ├── baselines.ipynb # Traditional ML baselines
│ │ ├── graph_features_baseline.ipynb # MLP + graph features
│ │ └── model_comparison_visualization.ipynb # Results visualization
│ └── other/ # Exploratory analysis
│
├── results/ # Experimental results
│ ├── evolve_gcn_multi_seed/ # Temporal GCN (seeds: 42, 123, 456)
│ ├── static_gcn_multi_seed/ # Static GCN (seeds: 42, 123, 456)
│ ├── baselines/ # Logistic Regression, RF, XGBoost
│ │ ├── logistic_regression/
│ │ ├── random_forest/
│ │ └── xgboost/
│ ├── graph_features_baseline/ # MLP + graph features
│ └── comparison_formatted.csv # Unified results comparison
│
├── graph_cache/ # Cached graph snapshots
├── tests/ # Unit tests
└── [Documentation files]
Models: Logistic Regression, Random Forest, XGBoost Features: 36 reduced transaction features Training: Per-K retraining for proper calibration
Architecture: Multi-layer perceptron Features: 36 reduced features + 7 graph structural features
- Total/in/out degree
- PageRank
- Betweenness centrality
- Degree ratio
- Normalized degree centrality
Architecture: 2-layer Graph Convolutional Network Training: Multi-seed (42, 123, 456) Features: 36 reduced + 1 temporal age feature Graph: Static snapshot at evaluation time
Architecture: 2-layer GCN + LSTM + classifier Training: Multi-seed (42, 123, 456), per-cohort with state reset Features: 36 reduced + 1 temporal age feature Graph: Temporal sequence with weighted edges
All results reported as Mean ± Std across 3 random seeds (for GNN models).
| Model | K | F1 Score | AUC | Precision | Recall |
|---|---|---|---|---|---|
| Temporal GCN | 1 | 0.312 ± 0.065 | 0.753 ± 0.043 | 0.229 ± 0.050 | 0.502 ± 0.127 |
| Temporal GCN | 3 | 0.338 ± 0.042 | 0.732 ± 0.062 | 0.263 ± 0.021 | 0.500 ± 0.160 |
| Temporal GCN | 5 | 0.301 ± 0.019 | 0.679 ± 0.015 | 0.231 ± 0.017 | 0.435 ± 0.025 |
| Temporal GCN | 7 | 0.332 ± 0.034 | 0.782 ± 0.003 | 0.233 ± 0.029 | 0.580 ± 0.036 |
| Static GCN | 1 | 0.168 ± 0.156 | 0.509 ± 0.154 | 0.133 ± 0.153 | 0.476 ± 0.477 |
| Static GCN | 3 | 0.123 ± 0.054 | 0.603 ± 0.013 | 0.460 ± 0.354 | 0.259 ± 0.361 |
| Static GCN | 5 | 0.179 ± 0.027 | 0.548 ± 0.042 | 0.185 ± 0.094 | 0.444 ± 0.480 |
| Static GCN | 7 | 0.166 ± 0.059 | 0.590 ± 0.050 | 0.151 ± 0.053 | 0.345 ± 0.380 |
| Model | K | F1 Score | AUC | Precision | Recall |
|---|---|---|---|---|---|
| Logistic Regression | 1 | 0.249 | 0.875 | 0.143 | 0.958 |
| Logistic Regression | 3 | 0.251 | 0.874 | 0.145 | 0.958 |
| Logistic Regression | 5 | 0.247 | 0.869 | 0.142 | 0.958 |
| Logistic Regression | 7 | 0.245 | 0.865 | 0.140 | 0.960 |
| Random Forest | 1 | 0.824 | 0.930 | 0.986 | 0.707 |
| Random Forest | 3 | 0.819 | 0.915 | 0.988 | 0.699 |
| Random Forest | 5 | 0.824 | 0.925 | 0.989 | 0.707 |
| Random Forest | 7 | 0.821 | 0.924 | 0.988 | 0.702 |
| XGBoost | 1 | 0.788 | 0.943 | 0.814 | 0.763 |
| XGBoost | 3 | 0.803 | 0.942 | 0.837 | 0.771 |
| XGBoost | 5 | 0.806 | 0.948 | 0.846 | 0.770 |
| XGBoost | 7 | 0.783 | 0.949 | 0.825 | 0.745 |
| Model | K | F1 Score | AUC | Precision | Recall |
|---|---|---|---|---|---|
| MLP + Graph Features | 1 | 0.233 | 0.712 | 0.133 | 0.909 |
| MLP + Graph Features | 3 | 0.234 | 0.685 | 0.134 | 0.908 |
| MLP + Graph Features | 5 | 0.234 | 0.686 | 0.134 | 0.909 |
| MLP + Graph Features | 7 | 0.233 | 0.691 | 0.134 | 0.909 |
Figure: Comprehensive comparison of graph-based model performance across different observation windows (K) showing F1 scores, AUC, precision, and recall metrics for all tested models.
-
Best Overall Performance: Random Forest and XGBoost significantly outperform GNN models (F1 ~0.80-0.82 vs. 0.30-0.34), suggesting transaction features are more informative than graph structure for this task
-
Temporal GCN vs. Static GCN: Temporal models consistently outperform static models, validating the importance of temporal dynamics
-
Observation Window Effects:
- Non-monotonic relationship with K
- Temporal GCN shows best performance at K=7 (AUC 0.782)
- XGBoost peaks at K=5 (F1 0.806)
- Random Forest relatively stable across K values
-
Class Imbalance Challenge: High recall but low precision in many models reflects the severe class imbalance (~5-8% illicit)
-
Model Stability: Multi-seed experiments reveal variable stability:
- Temporal GCN: relatively stable (std ~0.02-0.06 for F1)
- Static GCN: high variance (std up to 0.16 for F1)
- Python 3.11+
- CUDA-capable GPU (recommended) or CPU/MPS support
- Git LFS for large dataset files
# Install git-lfs
brew install git-lfs # macOS
# or
apt-get update && apt-get install git-lfs # Linux
# Initialize LFS
git lfs install
# Pull large files
git lfs pull# Create conda environment
conda env create -f env.yml
# Activate environment
conda activate graph_mlKey packages:
- PyTorch (CPU/CUDA/MPS)
- PyTorch Geometric
- PyTorch Geometric Temporal
- scikit-learn
- XGBoost
- pandas, numpy, matplotlib
- JupyterLab
- Temporal GCN: Open notebooks/experiments/evolve_gcn.ipynb
- Static GCN: Open notebooks/experiments/static_gcn.ipynb
- Baselines: Open notebooks/experiments/baselines.ipynb
- MLP + Graph Features: Open notebooks/experiments/graph_features_baseline.ipynb
- Visualize Results: Open notebooks/experiments/model_comparison_visualization.ipynb
from code_lib.temporal_node_classification_builder import TemporalNodeClassificationGraphBuilder
# Initialize builder
builder = TemporalNodeClassificationGraphBuilder(
observation_windows=[1, 3, 5, 7],
use_cache=True
)
# Build graphs for static models
train_data, val_data, test_data = builder.prepare_observation_window_graphs(K=3)
# Build sequences for temporal models
temporal_graphs = builder.prepare_temporal_model_graphs(K=3)Graph snapshots are automatically cached to graph_cache/ for faster repeated experiments. Cache keys include all relevant configuration parameters to ensure consistency.
For collaborative development on GPU cluster:
- Go to RunPod and select "Franek's Team" from the dropdown
- Navigate to "Pods" tab
- Select the "GraphML" storage volume
- Click "Change Template" and select "graph_ml_updated"
- Click "Deploy On-Demand"
- Once deployed, click "jupyterlab" to access JupyterLab in browser
Important:
- Do NOT delete the storage volume
- Remember to terminate the pod after use to avoid idle costs
- Git credentials are pre-configured in the network volume
If you use this code or methodology in your research, please cite the Elliptic dataset:
Weber, M., Domeniconi, G., Chen, J., Weidele, D. K. I., Bellei, C., Robinson, T., & Leiserson, C. E. (2019).
Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics.
KDD '19 Workshop on Anomaly Detection in Finance.
This project is for academic research purposes. Please ensure proper attribution when using this code or methodology.
For questions or issues, please open an issue in the repository.
