The main goal was to participate in a Kaggle competition and develop a model capable of predicting whether an IP address is associated with a VPN or Proxy service.
The task involved:
-
Training and evaluating binary classification models on an anonymized dataset of reported attacks provided by CrowdSec.
-
Participating in the Kaggle competition: Binary Classification of VPN Proxy IP Address.
-
Using F1-Score as the main evaluation metric (both for local validation and Kaggle submissions).
The dataset was highly imbalanced (~5% positive class, i.e., VPN/Proxy), which required careful feature engineering and validation strategies.
The work was organized into three main parts:
- Exploratory Data Analysis (EDA)
- Created 6 visualizations (bar plot, heatmap, marginal distributions, etc.) to analyze the relationship between features and the target variable.
- Identified correlations, imbalance issues, and relevant patterns in the dataset.
- Baseline
- Implemented a simple perceptron as a benchmark model.
- Applied categorical encodings and hyperparameter search.
- Evaluated performance on validation and test sets to establish a reference point.
- Competitive Models
-
Trained two different models with advanced feature engineering, including:
- Missing value imputation.
- Mean encoding on selected features.
- One-hot encoding on selected features.
-
Creation of at least 5 new features.
-
Performed hyperparameter tuning and reproducible validation.
-
Selected the best model based on F1-Score (validation + Kaggle results).
- Baseline (Perceptron): F1-score ≈ X.XX (validation).
- Model 1: F1-score ≈ X.XX (validation), X.XX (Kaggle).
- Model 2: F1-score ≈ X.XX (validation), X.XX (Kaggle).
- Best model selected: Model N.