In [1] we further develop the work of [2] and [3], which provide a set of evaluation metrics for saliency methods. We extend this set to a comprehensive list of metrics that mimic common metrics for classification evaluation based on the definition of correct and incorrect feature importance in images. Particularly, the following points are addressed:
- We include saliency metrics that produce interesting results (e.g. specificity), but were overlooked in [3]
- In addition to the saliency methods discussed in [2], we also include SHAP [4] and B-cos [5]
- We show how such metrics can be evaluated using Krippendorff's
$\alpha$ [6] and Spearman's$\rho$ [7][8] instead of taking them at face value (which is already a problem in XAI, as discussed in the paper)
This repository contains the code and datasets that are needed to recreate the experiments conducted in our paper: Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations. As our paper builds on [2], this repository also builds on their implementation of the Focus-metric.
- KernelSHAP from Captum of Kokhlikyan et al. [15] as a representative of the SHAP-family.
- B-cos networks version 2 based on the work of Boehle et al. [5].
- Smoothgrad [9]; implementation based on this repo of Nam et al.
- Layer-wise Relevance Propagation (LRP) [10]; implementation based on this repo of Nakashima et al.
- GradCAM [11]; implementation based on this repo of Gildenblat et al.
- LIME [12]; implementation based on this repo of Tulio et al.
- GradCAM++ [13]; implementation based on this repo of Gildenblat et al.
- Integrated Gradients (IG) [14]; implementation based on Captum of Kokhlikyan et al. [15].
The first two saliency methods (KernelSHAP and B-cos) were added by us, the other six saliency methods have been adopted unchanged from the Focus-metric repository.
This code runs under Python 3.10.4. The python dependencies are defined in requirements.txt.
We provide two data sets (located in folder data) that can be used to generate mosaics for the XAI evaluation (this is done by executing the script xai_eval_scripy.py; detailed instructions on how to run experiments see How to run experiments). The authors of [2] provide mosaics in their repository, which can also be used with our code. To do so, download the mosaics and copy them to data/mosaics/. When executing xai_eval_script.py the --dataset argument has to correspond to the mosaics (i.e. --dataset ilsvrc2012 for --mosaics ilsvrc2012_mosaics) and the --mosaics_per_class argument has to be None.
The models used here are all trained on ImageNet and are not fine-tuned. Therefore, new data sets should consist of classes that are available in ImageNet. To create a new data set so that it can be used with the code, perform the following steps:
-
Create subfolder in
data/datasets/with the dataset name. -
Add folder
data, which contains images that will be used for mosaic creation. Images should be named ClassName-ImageNumber and the ClassName has to match the specific label in ImageNet dataset. E.g.tabby-1.jpgortiger_cat-20.jpg -
Run script
create_csv.pyin subfolderdataset_managerto create the necessary csv-file for executing mosaic creation, heatmap calculation and evaluation. Provide the dataset name (has to match the folder the data subfolder is in) before running the script. -
Copy the file
imagenet_labels.csvinto the folder. -
Add new dataset in
consts/consts.pyas new class to DatasetArgs and MosaicArgs. Naming has to correspond to dataset folder name. Then add it also inconsts/paths.pyto DatasetPaths and MosaicPaths. -
Add new dataset in
dataset_manager/datasets.pyas a new dataset class and a new mosaic dataset class (guided by already existing classes). -
Add dataset in
explainability/pipelines.pyunder DATASETS guided by already existing entries.
Now the new dataset can be used to run experiments with different saliency methods.
-
To generate a mosaic dataset and corresponding heatmaps plus the classification metrics per mosaic, run script
xai_eval_script.py, with the arguments- --dataset: dataset used for mosaic generation
- --mosaics: name of mosaic dataset
- --mosaics_per_class: number of mosaics that will be created per class; if argument is None, existing mosaics will be used, OTHERWISE EXISTING MOSAICS WILL BE OVERWRITTEN
- --architecture: classification model to investigate
- --xai_method: saliency method used for generating heatmaps
e.g.
python xai_eval_script.py --dataset carsncats --mosaics carsncats_mosaic --mosaics_per_class 10 --architecture resnet50 --xai_method bcosThe mosaics will be stored in
data/mosaics/carsncats_mosaic, the heatmaps will be saved to folderdata/explainability/hashand the results for the classification metrics will be stored in the corresponding csv-file underdata/explainability/hash.csv. To find the hash that relates to the experiment, checkdata/explainability/hash_explainability.csv. If the mosaics already exist without a corresponding dataset, simply use the script with a consistent name for the --dataset argument and in coherence with the classes mentioned in Dataset instructions. -
To receive meaningful results about XAI-performance with the saliency metrics, models should be able to distinguish well between the different classes used in the mosaics. To test this, the script
model_eval.pycan be used. Corresponding accuracies (top-1 and top-5 accuracy for up to three target classes (depending on investigated dataset)) are saved inevaluation/model_accs.csv. -
There are a few different ways to vizualize the results of the experiments. For a general inspection of the different heatmaps of a single input image, the
sumgen_script.pycan be used.evaluation/create_summaries.batprovides a way as to create all relevant summaries for the datasets used in our paper. Note the--infoflag when usingsumgen_script.py, as not using the flag creates summaries as in the paper, whereas--infoalso shows all saliency metrics in the summaries to check whether new metrics work as expected. Created summaries are saved indata\mosaics\summary.
-
To evaluate results over entire datasets,
compute_viz_alphas.pycan be used, whereevaluation\compute_alphas.batprovides some examples for its usage. This script generates the violin plots of the saliency metric performance, the correlation plots for Spearmans rank correlation as the inter-method reliability (both saved inevaluation/figures/) and Krippendorff's$\alpha$ values as the inter-rater reliability (saved as a .csv inevaluation/alphas.csv). -
We also evaluated Krippendorff's
$\alpha$ between different datasets, different metrics on the same dataset and the two different models (only shortly mentioned in the paper). The scriptxai_ranking.pyranks the saliency methods per architecture and dataset according to the mean and median value for each of the saliency metrics (this is saved inevaluation/rankings). With these rankings, Krippendorff's$\alpha$ can be calculated for every architecture - dataset - metric combination (the other experiments can be created with a bit of tweaking in the script). The result is saved inevaluation/alphas.csv.
Please cite our paper when using this code.
@inproceedings{Fresz2024saliencyMetrics,
author = {Fresz, Benjamin and Lörcher, Lena and Huber, Marco},
title = {Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations},
year = {2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3630106.3658537},
doi = {10.1145/3630106.3658537},
booktitle = {The 2024 ACM Conference on Fairness, Accountability, and Transparency},
location = {Rio de Janeiro, Brazil},
series = {FAccT '24}
}
[1] Fresz, B., Lörcher, L., & Huber, M. (2024). Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24). Association for Computing Machinery, New York, NY, USA, 1–19.
[2] Arias-Duart, A., Parés, F., & García-Gasulla, D. (2021). Focus! Rating XAI Methods and Finding Biases with Mosaics. arXiv preprint arXiv:2109.15035
[3] Arias-Duart, A., Mariotti, E., Garcia-Gasulla, D., & Alonso-Moral, J. M. (2023). A confusion matrix for evaluating feature attribution methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3708-3713).
[4] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
[5] Böhle, M., Fritz, M., Schiele, B. (2022). B-cos Networks: Alignment is all we need for interpretability. arXiv preprint arXiv:2205.10268.
[6] Krippendorff, K. (2004). Reliability in content analysis: some common misconceptions and recommendations. Human Communication Research, 30(3), 411-433.
[7] Myers, J. L., Well, A. D., & Lorch Jr, R. F. (2013). Research design and statistical analysis. Routledge.
[8] Tomsett, R., Harborne, D., Chakraborty, S., Gurram, P., & Preece, A. (2020). Sanity checks for saliency metrics. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 04, pp. 6021-6029).
[9] Smilkov, D., Thorat, N., Kim, B., Viégas, F., & Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825.
[10] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7), e0130140.
[11] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618-626).
[12] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).
[13] Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018, March). Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV) (pp. 839-847). IEEE.
[14] Sundararajan, M., Taly, A., & Yan, Q. (2017, July). Axiomatic attribution for deep networks. In International Conference on Machine Learning (pp. 3319-3328). PMLR.
[15] Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Alsallakh, B., Reynolds, J., ... & Reblitz-Richardson, O. (2020). Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896.