This project combines machine learning and Answer Set Programming (ASP) to provide interpretable explanations for malware classification, based on feature manipulations derived from XGBoost models trained on the EMBER dataset.
.
├── dataset/ # EMBER dataset directory (download required)
├── export/ # Output directory for generated samples and solutions
├── model/ # Directory for saved models
│
├── lib/
│ ├── asp/ # XGBoost to ASP conversion logic
│ ├── dataset/ # EMBER dataset preprocessor
│ └── utils/ # Utility functions
│
├── metrics/
│ ├── booster.py # Checking the boosters module
│ └── generation.py # Checking the sample generation module
│
├── config.py # Configuration file for directories and parameters
│
├── narrow_bounds.py # ASP-based narrow bound solver
├── narrow_bounds_plot.py # Visualization for narrow bounds
│
├── expand_bounds.py # ASP-based expanded bound solver
├── expand_bounds_plot.py # Visualization for expanded bounds
│
├── sample_generation.py # Generate sample with desired malware probability
├── rule_extraction.py # Train XGBoost, extract rules, and convert to ASP
│
├── LICENSE
├── requirements.txt
└── README.md
- Python 3.12
- Install dependencies:
pip install -r requirements.txt
pip install git+https://github.com/blkdmr/ember.gitDownload the EMBER 2018 dataset from:
https://ember.elastic.co/ember_dataset_2018_2.tar.bz2
Place the archive in the dataset/ folder and extract it.
If you are using custom folders, update the paths in config.py.
python rule_extraction.py- Trains an XGBoost model
- Dumps the model
- Extracts decision rules
- Converts them to an ASP program
The first time you run this script, it will initialize the EMBER dataset.
python narrow_bounds.py- Finds minimal feature combinations to generate a sample with a target malware probability
- Saves the solution in the
export/directory
To visualize:
python narrow_bounds_plot.pyFirst, create a sample with a specific malware probability p:
python sample_generation.pyThen, expand bounds for a target probability q:
python expand_bounds.py- Alters the sample to achieve the new malware probability
q - Saves the result in the
export/directory
To visualize:
python expand_bounds_plot.pypython metrics/booster.py- Evaluates the trained booster (XGBoost model)
- Outputs performance metrics and checks internal booster statistics
python metrics/generation.py- Evaluates the quality of sample generation
- Outputs statistics related to malware probability manipulation and feature adjustments
This project is licensed under the MIT License. See LICENSE for details.