A complete and user-friendly implementation of a Network Intrusion Detection System (NIDS) built using deep learning, machine learning, and statistical feature analysis.
This project walks you through the full pipeline—from PCA-based preprocessing, to model training, to model deployment—with pre-trained models ready for immediate use.
Traditional rule-based intrusion detection systems often fail to recognize new or evolving cyber threats.
This project implements a modern deep-learning-driven NIDS pipeline that combines:
- PCA-driven dimensionality reduction
- Deep learning autoencoders for anomaly detection
- Machine learning classifiers (e.g., scikit-learn, XGBoost)
- Feature engineering and statistical selection methods
This repository is suitable for students, researchers, and engineers working on intelligent intrusion detection and cybersecurity analytics.
This project uses the Cleaned & Preprocessed CICIDS2017 Dataset, created by Eric Anacleto Ribeiro and published on Kaggle under a CC0: Public Domain license.
File Used:
cicids2017_cleaned.csv— A fully cleaned, merged, preprocessed version of CICIDS2017, containing raw but cleaned features ready for scaling and sampling after train/test split.
The original CICIDS2017 dataset captures both benign and malicious network traffic across multiple attack scenarios. While widely used, the raw files require heavy preprocessing for machine learning tasks.
The cleaned version used in this project includes the following improvements:
- Merged multiple CSV files into a unified dataset
- Removed duplicate rows and duplicate columns
- Replaced infinite values with NaN for consistency
- Removed rows with missing values (<1% of dataset)
- Removed columns with only one unique value
- Applied correlation analysis to remove highly correlated features (≥ 0.99)
- Combined Kruskal-Wallis H-test with Random Forest feature importance to remove statistically irrelevant features
- Converted original
Labelcolumn into a simplifiedAttack Typecolumn - Grouped similar attack types (e.g., DoS Hulk + DoS GoldenEye → "DoS")
- Removed extremely rare attack classes (e.g., Heartbleed, Infiltration) to avoid overfitting
This cleaned dataset forms the starting point of the PCA and model training workflow in this project.
You may download the dataset from the following locations.
🔹 Cleaned & Preprocessed CICIDS2017 Dataset (Used for PCA Analysis)
👉 Official Dataset Page (Kaggle): Dataset from kaggle
👉 Direct Download (google drive): Dataset from google drive
🔹 Fully processed CICIDS2017 Dataset (Used for model training)
👉 Direct Download (google drive): Dataset from google drive
This NIDS pipeline follows a structured, three-stage workflow:
- Performed on the Cleaned & Preprocessed CICIDS2017 dataset
- Reduces dimensionality
- Extracts high-value features
- Generates variance and correlation reports
- Uses PCA results + cleaned dataset
- Trains multiple models (Autoencoder, XGBoost, scikit-learn models)
- Saves models, encoders, scalers, thresholds, and metadata to
saved_models/
- Loads all pre-trained components
- Processes new samples
- Performs real-time intrusion detection
- Deep Learning Autoencoder for anomaly detection
- XGBoost + ML Pipelines for high-accuracy classification
- Complete PCA Workflow for dimensionality reduction
- Ready-to-Use Deployment Script
- Pretrained Models Included
- Feature importance analysis
- Interactive notebooks for exploration
Network-Intrusion-Detection-System/
│
├── .gradio/ # Gradio interface cache
│ └── certificate.pem # SSL certificate
│
│
├── img/ # images
│ ├── dimensionality_reduction.png
│ ├── feature_importance.png
│ ├── MAE.png
│ └── web_deployment_result.png
│
├── notebooks/ # Interactive analysis
│ ├── Network_Intrusion_Detection_System.ipynb
│ └── PCA_Analysis.ipynb
│
├── saved_models/ # ALL PRE-TRAINED MODELS
│ ├── trained_nids_model.pkl # Main scikit-learn model
│ ├── trained_nids_model_xgb.model # XGBoost model
│ ├── trained_nids_model_autoencoder.h5 # TensorFlow autoencoder
│ ├── trained_nids_model_scaler.pkl # Feature scaler
│ ├── trained_nids_model_label_encoder.pkl # Label encoder
│ ├── trained_nids_model_features.pkl # Feature list
│ ├── trained_nids_model_metadata.pkl # Training metadata
│ └── trained_nids_model_threshold.pkl # Anomaly threshold
│
├── feature_importance_results.csv # Feature analysis
├── LICENSE # MIT License
├── NIDS_Deploy.py # DEPLOYMENT SCRIPT
├── NIDS_Training.py # TRAINING SCRIPT
├── PCA_Analysis.py # PCA analysis script
├── trained_nids_simple.pkl # Lightweight model
├── README.md # This file
└── requirements.txt # Dependencies
- Click the Colab badge above
- Run the cell
- Results are displayed at the bottom
- Click the Colab badge above
- Select Run all
- Interact with the Gradio interface at the bottom
git clone https://github.com/ScriptedLines404/Network-Intrusion-Detection-System-using-Deep-Learning.git
cd Network-Intrusion-Detection-System-using-Deep-Learning
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
pip install -r requirements.txt
Perform PCA and feature analysis:
python PCA_Analysis.py
This step generates:
- PCA variance reports
- Feature correlation files
- leaned feature lists saved to saved_models/
Train all NIDS models:
python NIDS_Training.py
The script:
- Trains ML + DL models
- Saves all artifacts automatically
- Generates metadata for deployment
To run classification or anomaly detection:
python NIDS_Deploy.py
Loads all pre-trained models and performs inference on new network samples.
Located in:
saved_models/
Includes:
- Autoencoder
- XGBoost classifier
- Scikit-learn model
- Scaler, encoder, metadata, threshold, features
This enables instant model deployment without the need for retraining.
Dataset shape: (2,520,751 samples × 52 features)
Number of attack classes: 7
Target distribution:
• Normal Traffic: 2,095,057 samples (83.1%)
• DoS: 193,745 samples (7.7%)
• DDoS: 128,014 samples (5.1%)
• Port Scanning: 90,694 samples (3.6%)
• Brute Force: 9,150 samples (0.4%)
• Web Attacks: 2,143 samples (0.08%)
• Bots: 1,948 samples (0.08%)
Optimal number of components to explain 95.0% variance: 20
Total variance explained by first 20 components: 0.9508
| Rank | Feature | Importance Score |
|---|---|---|
| 1 | Min Packet Length | 0.154774 |
| 2 | PSH Flag Count | 0.150304 |
| 3 | Bwd Packet Length Min | 0.144673 |
| 4 | Subflow Fwd Bytes | 0.144076 |
| 5 | Total Length of Fwd Packets | 0.144001 |
| 6 | Bwd IAT Total | 0.132763 |
| 7 | Bwd Packet Length Mean | 0.129960 |
| 8 | Fwd Packet Length Mean | 0.126202 |
| 9 | Destination Port | 0.124723 |
| 10 | Active Max | 0.124153 |
| 11 | Fwd Packet Length Std | 0.124147 |
| 12 | Average Packet Size | 0.123569 |
| 13 | Fwd Packet Length Max | 0.122527 |
| 14 | Bwd Packet Length Std | 0.121820 |
| 15 | Packet Length Variance | 0.121274 |
| 16 | Active Mean | 0.120901 |
| 17 | Bwd Packet Length Max | 0.120738 |
| 18 | Packet Length Mean | 0.120551 |
| 19 | Packet Length Std | 0.119923 |
| 20 | ACK Flag Count | 0.119518 |
Classification accuracy using PCA components: 0.9985 (99.85%)
| Attack Type | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Bots | 0.83 | 0.65 | 0.73 | 389 |
| Brute Force | 1.00 | 1.00 | 1.00 | 1,830 |
| DDoS | 1.00 | 1.00 | 1.00 | 25,603 |
| DoS | 1.00 | 1.00 | 1.00 | 38,749 |
| Normal Traffic | 1.00 | 1.00 | 1.00 | 419,012 |
| Port Scanning | 0.99 | 0.99 | 0.99 | 18,139 |
| Web Attacks | 0.99 | 0.97 | 0.98 | 429 |
- Macro Average: 0.97 precision, 0.94 recall, 0.96 F1-score
- Weighted Average: 1.00 across all metrics
- Total Test Samples: 504,151
Accuracy: 0.9996 (99.96%)
Precision: 0.9996
Recall: 0.9996
F1-Score: 0.9996
Breakdown:
- Perfect detection of DDoS, DoS, and normal traffic (100% accuracy)
- High detection rates for port scanning and web attacks (>99%)
- Robust performance across all attack categories
Model Architecture:
- Input dimension: 20
- Encoding dimension: 10
- Total parameters: 7,518
- Training epochs: 30
Final Training Results:
- Validation loss: 0.0468
- Validation MAE: 0.0553
Anomaly Detection:
- Threshold: 0.1252 (95th percentile)
- Anomalies detected: 10,407 (4.96% of test set)
Reconstruction Error Statistics:
- Mean: 0.0360
- Std: 0.8902
- Max: 227.9218
- Time per epoch: 16–22 seconds
- Total training time: ~9 minutes
- Model size: 29.37 KB
- Efficient gradient boosting
- Fast convergence
- Lightweight model
- Real-time Processing: < 100ms inference time per sample
- Multi-model Ensemble: Combines XGBoost and Autoencoder
- Comprehensive Detection: 7 attack categories + anomalies
- User-friendly Interface: Web-based for easy access
Contributions are welcomed! 🙌
- 🍴 Fork the repository.
- 🌿 Create a new branch for your feature or bugfix.
- 🖊️ Write clear commit messages and include tests where possible.
- 📬 Submit a pull request with a detailed description.
Guidelines:
- 🧹 Follow Python best practices.
- 📚 Keep code clean and well-documented.
- 📝 Update relevant documentation when making changes.
This project is licensed under the MIT License. You are free to use, modify, and share this project with proper attribution.
Dataset License : The dataset used—Cleaned and Preprocessed CICIDS2017 Data for Machine Learning by Eric Anacleto Ribeiro—is under CC0: Public Domain.
-
Eric Anacleto Ribeiro, for preparing the cleaned CICIDS2017 dataset used in this project
-
The creators of the original CICIDS2017 dataset
-
Open-source developers and the cybersecurity research community
Hi, there!. I am Vladimir Illich Arunan, an engineering student with a deep passion for understanding the inner workings of the digital world. My goal is to master the systems that power modern technology—not just to build and innovate, but also to test their limits through cybersecurity.



