This project implements a dual-phase Network Intrusion Detection System (NIDS) using the NSL-KDD dataset, a benchmark for evaluating cybersecurity algorithms. It includes:
- A binary classification pipeline to distinguish between normal and attack traffic.
- A multiclass classification system to further classify attacks into high-level threat categories: DoS, Probe, Remote Access (R2L), and Privilege Escalation (U2R).
The goal is to build a comprehensive machine learning-based detection system that not only identifies threats but also categorizes them, helping to simulate a real-world SOC (Security Operations Center) use case.
With the exponential growth of digital connectivity, network intrusion has become a significant cybersecurity challenge. Most traditional systems either:
- Rely heavily on handcrafted rules (high false positive rate), or
- Fail to adapt across different attack types or traffic conditions.
This project solves both problems by:
- Combining feature engineering, statistical preprocessing, and ensemble ML models.
- Providing both binary threat detection and attack-type classification, aligned with practical SOC needs.
- Goal: Classify traffic as attack vs. normal
- Pipeline:
- Categorical encoding (Protocol, Service, Flag)
- Weight encoding for service frequency
- PCA + Percentile-based feature selection
- Ensemble modeling with:
- Logistic Regression
- Random Forest
- Gradient Boosting
- XGBoost
- LightGBM
- SVM
- Grid Search CV for hyperparameter tuning
- Voting ensemble for performance aggregation
- Goal: Detect attack type:
Normal | DoS | Probe | Remote Access (R2L) | Privilege Escalation (U2R) - Additional Steps:
- Class label restructuring based on NSL-KDD attack families
- Outlier-aware transformation: log-scaling of highly skewed features
- Weighted categorical encodings + one-hot encoding
- Correlation-driven feature selection using Spearman rank
- Detailed class-wise recall and confusion matrices
| Model | Binary Accuracy | Multiclass Accuracy |
|---|---|---|
| Logistic Regression | β High | |
| Random Forest | β High | β Good overall |
| Gradient Boosting | β High | β Strong balance |
| XGBoost | β Best overall | β Best for DoS |
| LightGBM | β Strong | |
| SVM | ||
| ANN (multiclass only) | β | β Excellent on "Normal" |
π Multiclass recall analysis showed U2R and R2L are the hardest to detect β consistent with real-world challenges.
- π Label Restructuring: Manual attack-to-category mapping for multiclass classification
- π Log Transformation of Outliers: For skewed count-based features
- π§Ό Weighted Encoding: For services based on their frequency
- π§ PCA + Feature Selection: Reduce dimensionality while preserving signal
- βοΈ Ensemble Modeling: Pipeline compatibility with both classifiers and meta-models
- π Model Fit Diagnostics: Custom overfit/underfit detection logic
- π Cross-Validation: Stratified K-Fold with performance logging
- π§ͺ GridSearchCV Tuning: Performed for every major model, including SVM, XGBoost, and LightGBM
- Clone the repository
git clone https://github.com/Adeleye-Emmanuel/Network-Intrusion-Detection cd Network-Intrusion-Detection - Prepare your environment
pip install -r requirements.txt
- Run either notebook
π§ Successfully built two distinct models for binary and multiclass NIDS
βοΈ Mastered data transformation, categorical encoding, and model pipelines
π Learned to evaluate not just accuracy, but recall per threat class
π§ͺ Tuned multiple classifiers with custom pipelines and cross-validation