Explanation methods in machine learning are designed to demystify the decision-making processes of complex models, particularly deep neural networks. These methods provide interpretable insights into why a model makes certain predictions, highlighting the features or patterns it deems most important. By making the model's internal mechanisms more transparent, explanation methods serve two crucial purposes: they enable researchers to identify and address potential flaws in the model, and they help build trust among end-users by making the model's decisions more understandable.
Traditionally, the focus has been on refining explanation methods to improve the quality of the explanations. However, the model being explained is a critical and often overlooked factor in this equation. The interplay between the model and the explanation method significantly impacts the resulting explanations, yet this relationship remains largely unexplored.
In this blog post, we delve into this relationship, examining how a model's characteristics influence the effectiveness of explanation methods and, consequently, the quality and accuracy of the insights we can derive from them. Before we examine how the model influences the quality of an explanation, we must define what makes an explanation good.
What makes an explanation good?
Usually, researchers use two key criteria:
Plausibility: The explanations must be understandable and convincing to humans.
Faithfulness: The explanations must accurately reflect the model's internal mechanisms. Essentially, the explanation shouldn't be misleading.
An explanation must be faithful; otherwise, it will be misleading. An explanation must be plausible; otherwise, it will be useless to show humans. Therefore, we want explanations to be both faithful and plausible. Makes sense? Great! Let's explore how the model can influence each of these criteria.
How the model impacts plausibility
An explanation can be entirely faithful—accurately reflecting the model's internal mechanisms—yet still be difficult or impossible for humans to understand. This disconnect can occur for two main reasons:
Flawed Internal Mechanisms: If the model's underlying processes are flawed, a faithful explanation of those processes will inherit those flaws, making it challenging to understand or interpret correctly.
Excessive Complexity: Even if the model's internal mechanisms are sound, they may be so complex that a faithful explanation becomes too intricate for easy human comprehension.
This phenomenon isn't unique to machine learning. In academic research, we often encounter papers that are difficult to understand, not only because of poor writing but also because the underlying concepts are either flawed or exceedingly complex. The same principle applies to explanations of machine learning models. Here are some examples of how a model's internal mechanisms can impact plausibility.
Shortcut learning
Machine learning models often use shortcuts. It's like students who learn how to do well on the exam instead of understanding the subject. For example, an image classifier can classify a cow by only looking at the green background—it fails to predict a cow on a beach. A faithful explanation would highlight the background, not the cow. This explanation would confuse users without the background (🤣) information that machine learning models learn such shortcuts. To learn more about shortcut learning in deep neural networks, check out the paper Shortcut Learning in Deep Neural Networks by Geirhos et al. It is one of my favorite papers!
Adversarial examples
Image classifiers that are sensitive to adversarial examples also produce less plausible explanations. Basically, you can add invisible noise to images that trick the models into thinking an image of a cow is an image of an airplane. Weird, right? There is a paper called "Adversarial Examples Are Not Bugs, They Are Features" which shows that the models learn to use tiny pixel changes as features. Consequently, faithful explanations for such sensitive models look weird, and they should! If a tiny change to a pixel changes the output, the explanation should highlight it as important. The issue isn't the explanation method; it is the model!
So, can we improve plausibility by improving the model?
Improving plausibility
Several papers show that we can make explanations more plausible by training the models to be robust against adversarial examples. The results from the paper Robustness May Be at Odds with Accuracy by Tsipras et al. are particularly striking. The top row shows the image, the second row shows the explanation from a standard model, and the last two rows show explanations produced by two different adversarially robust models. They all use the same explanation method, yet the difference is striking.
In our paper, An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare Records, we showed that the same is true for text classifiers. However, the improvements were less drastic than those for image classification. I think it is because image classifiers are more sensitive to adversarial examples than text classifiers.
Making models robust towards adversarial examples is relatively straightforward, but how do we prevent the cow classifier from relying on the background? This is far more difficult and unsolved at the current moment. Preventing models from relying on spuriously correlated features is one of the main objectives of causal machine learning. The paper Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond by Feder et al. introduces the topic well.
How the model impacts faithfulness
Some models cause certain explanation methods to produce less faithful explanations than other models. Our paper, Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attribution Explainability, showed that BERT trained on Yelp produced far more truthful explanations than RoBERTa trained on IMDB when using the same explanation method. Why does that happen?
Explanation methods make assumptions about the models' internal mechanisms. For instance, most explanation methods assume feature independence (a quite severe assumption). Faithfulness measures how accurately the methods' assumed mechanisms reflect the model's actual mechanisms. If an explanation method assumes feature independence, but the model relies on feature interactions, the explanations will be less faithful.
Improving faithfulness
There are two approaches to improving faithfulness:
Train the model to fit the explanation method's assumptions better. For example, train the model to not rely on feature interactions.
Develop explanation methods that make fewer assumptions. For example, develop explanation methods that don't assume feature independence.
Multiple studies have tried the first approach. However, we demonstrated in Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attribution Explainability that the evaluation metric they used can’t be compared across different models, which all of them did. Therefore, the conclusions of these studies may be misguided. Furthermore, most of them use gradient-based explanation methods, which I argued are not explanations in my previous blog post.
Even if you could train the model to produce a more faithful explanation, I'm skeptical about whether it is the best approach. I think we should try to improve the explanation methods before we limit the models' internal processes. For a starter, we should not assume feature independence—feature interactions are the whole point of using attention.
Wrapping up
Models can impact the explanations' plausibility or faithfulness in different ways. They can improve plausibility by relying on features and patterns that align with human intuition and domain knowledge. Models can improve faithfulness by aligning more closely with the assumptions made by explanation methods, though this approach has its limitations.
However, I believe we should divide the responsibilities between the model and explanation methods:
The model's responsibility is to learn meaningful and robust features that generalize well, which often leads to more interpretable decision-making processes.
The explanation methods' responsibility is to accurately reflect the model's internal mechanisms, regardless of their complexity.
This division of responsibilities allows us to optimize both aspects without unnecessarily constraining either the model's performance or the explanation's accuracy.
I hope you enjoyed this blog post. I would love to hear if you disagree or have interesting insights on this topic. See you next time!