An awesome and organized multidisciplinary reference list of evaluation measures and methods for explainable machine learning (XAI) algorithms and systems. If you need more details and descriptions, you can read the full paper or visit my page for more resources!
We reviewed XAI-related research to organize different XAI design goals and evaluation measures. This awesome-list presents our categorization of selected existing design and evaluation methods that organizes literature along with three perspectives: design goals, evaluation methods, and targeted users of the XAI system. We provide summarized, ready-to-use tables of evaluation methods and recommendations for different goals in XAI research.
Description and details in this paper: https://arxiv.org/pdf/1811.11839.pdf
@article{mohseni2018multidisciplinary,
title={A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems},
author={Mohseni, Sina and Zarei, Niloofar and Ragan, Eric D},
journal={arXiv preprint arXiv:1811.11839},
year={2018}
}
- Computational Measures
- M1: Fidelity of Interpretability Method
- M2: Model Trustworthiness
- Human-grounded Measures
- M3: Human-machine Task Performance
- M4: User Mental Model
- M5: User Trust and Reliance
- M6: Explanation Usefulness and Satisfaction
Paper | Evaluation Method |
---|---|
A Human-Grounded Evaluation Benchmark for Local Explanations of Machine Learning | Human-grounded Baseline |
A unified approach to interpreting model predictions | Human Judgment |
Quantifying Interpretability and Trust in Machine Learning Systems | Human Judgment |
Human attention in visual question answering: Do humans and deep networks look at the same regions? | Human-grounded Baseline |
Visualizing and understanding convolutional networks | Debugging model and training |
Towards Explanation of DNN-based Prediction with Guided Feature Inversion | Human-grounded Baseline |
Explainable Deep Classification Models for Domain Generalization | Human Judgment |
Score-CAM: Improved Visual Explanations Via Score-Weighted Class Activation Mapping | Human-grounded Baseline |
Human-in-the-Loop Interpretability Prior | Human Judgment |
Paper | Evaluation Method |
---|---|
Are explanations always important? a study of deployed, low-cost intelligent interactive systems | Interview and Self-report |
Assessing demand for intelligibility in context-aware applications | Interview and Self-report |
How should I explain? A comparison of different explanation types for recommender systems | Interview, Self-report, User Learning duration |
Why and why not explanations improve the intelligibility of context-aware intelligent systems | Interview and Self-report |
Intellingo: An Intelligible Translation Environment | Likert-scale Questionnaire |
Human Evaluation of Models Built for Interpretability | Likert-scale Questionnaire |
Intellingo: An Intelligible Translation Environment | Engagement with Explanations |