Machine Learning and Deep Learning Models on EMBER2018 & CIC-EvasivePDF2022
Malware detection remains a critical challenge as attackers constantly evolve tactics.
This project benchmarks traditional ML, deep learning, and ensemble methods for malware detection across two generations of attacks:
- EMBER2018 β legacy malware (2006β2018)
- CIC-EvasivePDF2022 β recent evasive PDF malware
The aim is to highlight strengths, weaknesses, and trade-offs between models, focusing on accuracy, adaptability, and robustness to imbalanced datasets.
-
Datasets:
- EMBER2018 (structured malware features, legacy attacks)
- CIC-EvasivePDF2022 (modern evasive samples, PDFs)
-
Algorithms Tested:
- Traditional ML: Random Forest, XGBoost, AdaBoost, Logistic Regression, KNN
- Deep Learning: CNN, MLP, RNNβLSTM, Transformer
- Ensembles: Stacking, Voting classifiers (hybrid ML + DL)
-
Challenges Tackled:
- Class imbalance handling (resampling, weighting)
- Comparative evaluation of structured vs evasive malware detection
| Model / Method | EMBER2018 Accuracy | CIC-EvasivePDF2022 Accuracy |
|---|---|---|
| Random Forest | 99.6% | 99.3% |
| XGBoost | 99.7% | 99.1% |
| CNN | 78.4% | 98.1% |
| RNNβLSTM | 50.4% | 96.2% |
| Transformer | 76.7% | 97.4% |
| Voting Ensemble | 99.5% | 99.1% |
π Key Insight:
- ML methods excel on structured, legacy malware (EMBER).
- DL models shine on evasive, complex malware (CIC-EvasivePDF).
- Ensembles combine the best of both worlds.
- Python 3.9+
- Scikit-learn, PyTorch, XGBoost
- Pandas, NumPy, Matplotlib/Seaborn for analysis
- Clone repo
git clone https://github.com/MAvRK7/Bridging-Legacy-Modern-Threat-Detection.git cd Bridging-Legacy-Modern-Threat-Detection