The upcoming book "Interpretable AI: Interpreting, Explaining and Visualizing Deep Learning" will be published at Springer LNCS and is a follow-up to the NIPS'17 Workshop on Interpreting, Explaining and Visualizing Deep Learning ( While machine learning models have reached impressively high predictive accuracy, they are often perceived as black-boxes. In sensitive applications such as medical diagnosis or self-driving cars, the reliance of the model on the right features must be guaranteed. The book will be composed of technical parts covering methods of interpretability, interpretable ML architectures, how to evaluate these techniques, as well as a part on applications.

Book contributor?
Login with your 8-character user ID:

Table of Contents

Part 1: Towards AI Transparency
Wojciech Samek, Klaus-Robert Müller: Towards Explainable Artificial Intelligence
Over the last years machine learning (ML) has become a key enabling technology for the sciences and industry. Especially through improvements in methodology, the availability of large databases and increased computational power, today's ML algorithms are able to achieve excellent performance (at times even exceeding the human level) on an increasing number of complex tasks. Deep learning models are at the forefront of this development. However, due to their nested nonlinear structure, these powerful models have been generally considered "black boxes", not providing any information about what exactly makes them arrive at their predictions. Since in many applications, e.g., in the medical domain, such lack of transparency may be not acceptable, the development of methods for visualizing, explaining and interpreting deep learning models has recently attracted increasing attention. This paper presents recent developments and applications in this field and makes a plea for more use of explainable learning algorithms in practice.
Adrian Weller: Challenges for Transparency
Transparency is often deemed critical to enable effective real-world deployment of intelligent systems. Yet the motivations for and benefits of different types of transparency can vary significantly depending on context, and objective measurement criteria are difficult to identify. We provide a brief survey, suggesting challenges and related concerns, particularly when agents have misaligned interests. We highlight and review settings where transparency may cause harm, discussing connections across privacy, multi-agent game theory, economics, fairness and trust.
Lars Kai Hansen, Laura Rieger: Interpretability in intelligent systems - a new concept?
We argue that the very active machine learning of interpretability community can learn from the 50 year history of explainable AI. We discuss the relevance of a set of explanation desiderata based on work on explainable expert systems and tools for comparing and quantify uncertainty of high-dimensional feature importance maps, developed in the neuroimaging community.
Part 2: Methods for Interpreting AI Systems
Anh Nguyen, Jason Yosinski, Jeff Clune: Understanding Neural Networks by Synthesizing Preferred Inputs: A survey
A neuroscience method to understanding the brain is to find and study the preferred stimuli that highly activate an individual cell or groups of cells. Unlike with the natural brain, it is possible to backpropagate through the artificial neural networks to synthesize the preferred stimuli that cause a neuron to fire strongly. That method is called Activation Maximization (AM) or feature visualization via optimization. In this chapter, we (1) review existing AM techniques in the literature; (2) discuss a probabilistic interpretation for AM; and (3) review the applications of AM in debugging and explaining networks.
Seunghoon Hong, Dingdong Yang, Jongwook Choi, Honglak Lee : Interpretable Text-to-Image Synthesis with Hierarchical Semantic Layout Generation
Generating images from natural language description has drawn a lot of attention in research community, not only because of its practical usefulness but also to understand how the model relates text with visual concepts by synthesizing them. Deep generative models have been successfully employed to address this task, which formulates the problem as a translation task from text to image. However, learning a direct mapping from text to image is not only challenging due to the complexity of the mapping, but also makes it difficult to understand the underlying generation process. To address these issues, we propose a novel hierarchical approach for text-to-image synthesis by inferring semantic layout. Our algorithm decomposes the generation process into multiple steps, in which it first constructs a semantic layout from the text by the layout generator and converts the layout to an image by the image generator. The proposed layout generator progressively constructs a semantic layout in a coarse-to-fine manner by generating object bounding boxes and refining each box by estimating object shapes inside the box. The image generator synthesizes an image conditioned on the inferred semantic layout, which provides a useful semantic structure of an image matching with the text description. By conditioning the generation with the inferred semantic layout, our model not only generates semantically more meaningful images, but also provides interpretable representations that allow users to interactively control the generation process by modifying the layout. We demonstrate the capability of the proposed model on challenging MS-COCO dataset and show that the model can substantially improve the image quality, interpretability of output and semantic alignment to input text over existing approaches.
Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, Masashi Sugiyama: Unsupervised Discrete Representation Learning
Learning discrete representations of data is a central machine learning task because of the compactness of the representations and ease of interpretation. The task includes clustering and hash learning as special cases. Deep neural networks are promising to be used because they can model the non-linearity of data and scale to large datasets. However, their model complexity is huge, and therefore, we need to carefully regularize the networks in order to learn useful representations that exhibit intended invariance for applications of interest. To this end, we propose a method called Information Maximizing Self-Augmented Training (IMSAT). In IMSAT, we use data augmentation to impose the invariance on discrete representations. More specifically, we encourage the predicted representations of augmented data points to be close to those of the original data points in an end-to-end fashion. At the same time, we maximize the information-theoretic dependency between data and their predicted discrete representations. Extensive experiments on benchmark datasets show that IMSAT produces state-of-the-art results for both clustering and unsupervised hash learning.
Seong Joon Oh, Bernt Schiele, Mario Fritz: Towards Reverse-Engineering Black-Box Neural Networks
Much progress in interpretable AI is built around scenarios where the user, one who interprets the model, has a full ownership of the model to be diagnosed. The user either owns the training data and computing resources to train an interpretable model herself or owns a full access to an already trained model to be interpreted post-hoc. In this chapter, we consider a less investigated scenario of diagnosing black-box neural networks, where the user can only send queries and read off outputs. Black-box access is a common deployment mode for many public and commercial models, since internal details, such as architecture, optimisation procedure, and training data, can be proprietary and aggravate their vulnerability to attacks like adversarial examples. We propose a method for exposing internals of black-box models and show that the method is surprisingly effective at inferring a diverse set of internal information. We further show how the exposed internals can be exploited to strengthen adversarial examples against the model. Our work starts an important discussion on the security implications of diagnosing deployed models with limited accessibility. The code is available at
Part 3: Explaining Decisions of AI Systems
Ruth Fong and Andrea Vedaldi: Explanations for Attributing Deep Neural Network Predictions
Given the recent success of deep neural networks and their applications to more high impact, high risk applications, like autonomous driving and healthcare decision-making, there is a great need for faithful and interpretable explanations of "why" an algorithm is making a certain prediction. In this chapter, we introduce 1., Meta-Predictors as Explanations, a principled framework for learning explanations for any black box algorithm, and 2., Meaningful Perturbations, an instantiation of our paradigm applied to the problem of attribution, which is concerned with attributing what features of an input (i.e., regions of an input image) are responsible for a specific, algorithmic output (i.e., a CNN classifier's object class prediction). We also briefly survey existing visual attribution methods and highlight how they fail to be both 'faithful' and 'interpretable'.
Marco Ancona, Enea Ceolini, Cengiz Öztireli, Markus Gross: Gradient-based Attribution Methods
The problem of explaining complex machine learning models,including Deep Neural Networks, is gaining increasing attention over the last few years. While several methods have been proposed to explain network predictions, the definition itself of explanation is often unclear.Moreover, only a few attempts to compare explanation methods from a theoretical perspective has been done. In this chapter, we discuss theoretical properties of several attribution methods and show how they share the same idea of using the gradient information as a descriptive factor for the functioning of a model. By reformulating two of these methods,we construct a unified framework which enables a direct comparison, as well as an easier implementation.
Grégoire Montavon, Alexander Binder, Sebastian Lapuschkin, Wojciech Samek, Klaus-Robert Müller: Layer-wise Relevance Propagation: An Overview
In order to trust a machine learning model, it is important to be able to understand it. Simple methods for explaining machine learning predictions exist, but they often do not scale well with model complexity. Layer-wise Relevance Propagation (LRP) is a technique to explain the predictions of highly nonlinear deep neural networks. LRP backpropagates the neural network prediction using a set of carefully engineered propagation rules, and produces human-interpretable explanations in terms of input features. In this chapter, we show how LRP can be easily and efficiently implemented, how the LRP propagation rules can be justified as a 'deep Taylor decomposition' of the quantity to explain, how to choose the LRP propagation rules at each layer to deliver high explanation quality, and how the LRP technique can be extended to handle a variety of machine learning scenarios beyond deep neural networks.
Leila Arras, Jose Arjona-Medina, Michael Gillhofer, Michael Widrich, Grégoire Montavon, Klaus-Robert Müller, Sepp Hochreiter, Wojciech Samek: Interpreting and Explaining LSTMs with LRP
While neural networks have acted as a strong unifying force in the design of modern AI systems, the neural network architectures themselves remain highly heterogeneous due to the variety of tasks to be solved. In this chapter, we explore how to adapt the LRP technique used for explaining the predictions of feed-forward networks to the LSTM architecture used for sequential data modeling and forecasting. The special accumulators and gated interactions present in the LSTM require both a new propagation scheme and an extension of the underlying theoretical framework to deliver faithful explanations.
Part 4: Evaluating Interpretability and Explanations
Bolei Zhou, David Bau, Aude Oliva, Antonio Torralba: Comparing the Interpretability of Deep Networks via Network Dissection
In this chapter, we introduce Network Dissection, a general framework to quantify the interpretability of the units inside a deep convolutional neural networks (CNNs). We compare the different vocabularies of interpretable units as concept detectors emerged from the networks trained to solve different supervised learning tasks such as object recognition on ImageNet and scene classification on Places. The network dissection is further applied to analyze how the units acting as semantic detectors grow and evolve over the training iterations both in the scenario of the train-from-scratch and in the stage of the fine-tuning between data sources. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into their deep structures.
Grégoire Montavon: Gradient-Based vs. Propagation-Based Explanations: An Axiomatic Comparison
Deep neural networks, once considered to be inscrutable black-boxes, are now supplemented with techniques that can explain how these models decide. This raises the question whether the produced explanations are reliable. In this chapter, we consider two popular explanation techniques, one based on gradient computation and one based on a propagation mechanism. We evaluate them using three "axiomatic" properties: conservation, continuity, and implementation invariance. These properties are tested on the overall explanation, but also at intermediate layers, where our analysis brings further insights on how the explanation is being formed. We then introduce a neuron-level interpolation between the two explanation techniques, and demonstrate that axiomatic properties are becoming better fulfilled, thereby leading to more reliable explanations.
Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, Been Kim : The (Un)reliability of Saliency Methods
Saliency methods aim to explain the predictions of deep neural networks. These methods lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction. We use a simple and common pre-processing step -adding a constant shift to the input data- to show that a transformation with no effect on the model can cause numerous methods to incorrectly attribute. In order to guarantee reliability, we posit that methods should fulfill input invariance, the requirement that a saliency method mirror the sensitivity of the model with respect to transformations of the input. We show, through several examples, that saliency methods that do not satisfy input invariance result in misleading attribution. The approach can be seen as a type of unit test; we construct a narrow ground truth to measure one stated desirable property. As such, we hope the community will embrace the development of additional tests.
Part 5: Applications
Computer Vision
Markus Hofmarcher, Thomas Unterthiner, Jose Arjona-Medina, Günter Klambauer, Sepp Hochreiter, Bernhard Nessler: Visual Scene Understanding for Autonomous Driving using Semantic Segmentation
Deep neural networks are an increasingly important technique for autonomous driving, especially as a visual perception component. Acting in a real environment necessitates the explainability and inspectability of the algorithms controlling the vehicle. Such insightful explanations are relevant not only for legal issues and insurance matters but also for engineers and developers in order to achieve provable functional quality guarantees. This applies to all scenarios where the results of deep networks control potentially life threatening machines. We suggest the use of a tiered approach, whose main component is a semantic segmentation model, over an end-to-end approach for an autonomous driving system. In order for a system to provide meaningful explanations for its decisions it is necessar
Christopher J. Anders, Grégoire Montavon, Wojciech Samek, Klaus-Rober Müller: Understanding Patch-Based Learning of Video Data by Explaining Predictions
Deep neural networks have shown to learn highly predictive models of video data. Due to the large number of images in individual videos, a common strategy for training is to repeatedly extract short clips with random offsets from the video. We apply the deep Taylor / Layer-wise Relevance Propagation (LRP) technique to understand classification decisions of a deep network trained with this strategy, and identify a tendency of the classifier to look mainly at the frames close to the temporal boundaries of its input clip. This ``border effect'' reveals the model's relation to the step size used to extract consecutive video frames for its input, which we can then tune in order to improve the classifier's accuracy without retraining the model. To our knowledge, this is the first work to apply the deep Taylor / LRP technique on any neural network operating on video data.
Physics, Chemistry, and Ecology
Kristof T. Schütt, Michael Gastegger, Alexandre Tkatchenko, Klaus-Robert Müller: Quantum-chemical insights from interpretable atomistic neural networks
With the rise of deep neural networks for quantum chemistry applications, there is a pressing need for architectures that, beyond delivering accurate predictions of chemical properties, are readily interpretable by researchers. Here, we describe interpretation techniques for atomistic neural networks on the example of Behler--Parrinello networks as well as the end-to-end model SchNet. Both models obtain predictions of chemical properties by aggregating atom-wise contributions. These latent variables can serve as local explanations of a prediction and are obtained during training without additional cost. Due to their correspondence to well-known chemical concepts such as atomic energies and partial charges, these atom-wise explanations enable insights not only about the model but more impor
Kristina Preuer, Günter Klambauer, Friedrich Rippmann, Sepp Hochreiter, Thomas Unterthiner: Interpretable Machine Learning in Drug Development
Without any means of interpretation, neural networks that predict molecular properties and bioactivities are merely black boxes. We will unravel these black boxes and will demonstrate approaches to understand the learned representations which are hidden inside these models. We show how single neurons can be interpreted as classifiers which determine the presence or absence of pharmacophore-like structures, thereby generating new insights and relevant knowledge for both pharmacology and biochemistry. We further discuss how these novel toxicophores can be determined from the network by identifying the most relevant components of a compound for the prediction of the network. Additionally, we propose a method which can be used to extract new pharmacophores from a model and will show that these extracted structures are consistent with literature findings. We envision that having access to such interpretable knowledge is a crucial aid in the development and design of new pharmaceutically
Frederik Kratzert, Mathew Herrnegger, Daniel Klotz, Seep Hochreiter, Günter Klambauer: NeuralHydrology - Interpreting LSTMs in Hydrology
Despite the huge success of Long Short-Term Memory networks, their applications in environmental sciences are scarce. We argue that one reason is the difficulty to interpret the internals of trained networks. In this study, we look at the application of LSTMs for rainfall-runoff forecasting, one of the central tasks in the field of hydrology, in which the river discharge has to be predicted from meteorological observations. LSTMs are particularly well-suited for this problem since memory cells can represent dynamic reservoirs and storages, which are essential components in state-space modelling approaches of the hydrological system. On basis of two different catchments, one with snow influence and one without, we demonstrate how the trained model can be analyzed and interpreted....
Brain & Behavior
Pamela K. Douglas, Ariana Anderson : Feature Fallacy: Complications with Interpreting Weights in Functional MRI Decoding
Decoding and encoding models are popular multivariate approaches used to study representations in functional neuroimaging data. Encoding approaches seek to predict brain activation patterns using aspects of the stimuli as features. Decoding models, in contrast, utilize measured brain responses as features to make predictions about experimental manipulations. Both approaches typically include linear classification components. Ideally, decoding and encoding models could be used for the dual purpose of prediction and neuroscientific knowledge gain. However, even within a linear framework, interpretation of either approach can be difficult. Encoding models suffer from feature fallacy; multiple combinations of features derived from a stimulus may describe measured brain responses equally well. Interpreting linear decoding models also requires great care, particularly when informative predictor variables (e.g., fMRI voxels) are correlated with noise measurements, even when regularization is applied. In certain cases, noise channels may be assigned a stronger weight than channels than contain relevant information. Although corrections for this problem exist, there are certain noise sources - common to functional neuroimaging recordings - that may complicate corrective approaches. Here, we review potential pitfalls for making inferences based on encoding and decoding hypothesis testing, focusing on the challenges associated with interpreting decoding models.
Marcel A.J. van Gerven, Katja Seeliger, Umut Güclü, Yagmur Güclütürk : Artificial Neural Networks for Brain and Behaviour
Artificial neural networks (ANNs) have made a significant impact in a large number of research fields. The purpose of this chapter is to provide an overview of the various ways in which researchers in the social sciences can also benefit from embracing neural networks. ANN models can be used to better understand brain function, or to gain a better understanding of how biological agents learn, reason and behave. Furthermore, ANNs afford various applications in neuroscience and behavioural science. In our exposition we will focus mainly on the use of rate-based neural networks.
Part 6: Software
Maximilian Alber: Software Design for Explanation Methods
Deep neural networks successfully pervaded many applications domains and are increasingly used in critical decision processes. Understanding their workings is desirable or even required to further alleviate their potential as well as to access sensitive domains like medical applications or autonomous driving. One key to this broader usage of explaining frameworks is the accessibility and understanding of respective software. In this work we introduce the software and application patterns for explanation techniques that aim to explain individual predictions of neural networks. We discuss how to code well-known algorithms efficiently within deep learning software frameworks and describe how to embedded algorithms in downstream implementations. Building on this we show how explanation methods can be used in applications to under- stand predictions for miss-classified samples, to compare algorithms or networks, and to examine the focus of networks. Furthermore, we review available

Information for Authors