Explaining Complex Models in Production: SHAP Walkthrough

June 14, 2022

Embarking on the journey of implementing models in production can be both thrilling and challenging. It’s where the rubber meets the road in the machine learning (ML) landscape. Transforming a model from a prototype to a production-ready state involves a blend of art, science, and engineering. It’s an endeavor that showcases the real power and value of ML, making tangible impacts on businesses and driving data-driven decision-making. However, the road to achieving a robust models in production environment isn’t without hurdles.

Machine learning is becoming an ever more important part of our businesses and personal lives. ML models can help make business operations faster, more reliable, more scalable, and cost-effective. However, as we hand over more responsibility to ML-driven processes, the need to understand how our models make decisions is increasingly important.

Simpler models can be more understandable but their performance may be limited in many cases. For example, in a linear regression, the coefficients tell us the magnitude and direction of each feature’s contribution to the prediction. In addition, they’re primarily useful in situations where features have a linear relationship (or one that can be made linear) to the target. But in other cases, we may want to use more sophisticated models to improve accuracy.

These more sophisticated models are often non-linear combinations of simpler linear models, ensembles of decision trees, or even ensembles of other complicated models. They may have many many coefficients (or weights) and non-linearities that contribute in different amounts in different situations. Typically there are so many weights that it is essentially impossible to manually/mentally untangle all the various effects to know what a model is doing. Consequently, as models become more sophisticated and accurate they become more opaque and it becomes more difficult to diagnose how and why they make certain predictions.

Understanding and explaining these more complex models that we want to use in production is not only critical for legal and ethical reasons but also makes solid business sense before we hand off important decisions to automated systems. For these more complex models, we have to try some indirect approaches. Two of the more common and useful approaches are Shapley Values, which are a way of estimating a particular feature’s effect on a specific prediction, and Partial Dependence and Individual Conditional Expectation plots, which are used to visualize the interaction between the features and the prediction values.

Shapley Values

The Python SHAP package is very useful and commonly used to analyze models. SHAP “explains” a model’s prediction on a given input by estimating how much each input to the model contributes to that prediction, based on the effect caused by various combinations of features in a model.

As an example, we can use the House Sales in King County (Seattle, Washington area) dataset to create a model to predict prices. This particular dataset has 17 features which include ones you might expect like the number of square feet of living space, latitude, longitude, number of bedrooms, and so on.

We can use the SHAP package to examine the impact of these features on an individual prediction. In this example, the true sale price was $565,000.00 and the model predicted $565,189.88. SHAP analysis tells us that sqft_living contributed ~$70k whereas the longitude and latitude (location) deducted ~$7k and ~$31k respectively. These values are added to the base (mean) prediction value of $536,992.23 to arrive at an explanation for the final prediction.

Partial Dependence and Individual Conditional Expectation plots

In addition to explaining the effect of different features on specific predictions, we also often want to know how the value of a certain feature affects predictions overall. Partial Dependence Plots (PDP) and the related Individual Conditional Expectation (ICE) plots are useful ways to visualize the interaction between the features and the prediction values. In each of these plots, the value of a feature of interest is varied, while the others are held constant. The predictions are then averaged with a standard deviation for the PDP or plotted individually for an ICE plot. This allows us to estimate how changing the value of different variables changes the model’s predictions.

Below is the PDP for Latitude for our model calculated from a sample of the data. It shows the expected effect and variance on the prediction for various values of Latitude while keeping all other values constant. Note that the latitude of Seattle is about 47.6 degrees. We can see that, on average, Latitude has a larger positive effect on the price as the house gets closer to Seattle, and that being to the north of 47.6 degrees also has a larger positive effect on price than being to the south.

This is a clear example of a non-linear effect of a feature on a prediction.

Challenges of Using SHAP in Production

We’ve seen that these approaches are useful in understanding our models. The SHAP package in particular can analyze Shapley values and create PDP/ICE plots; it is incredibly useful for an individual data scientist to examine their model using Python.

However, it’s also important to be able to explain a model’s behavior on real-world data in production. The ability to do so can help a data scientist to diagnose odd model behavior quickly and efficiently, in the actual environment where the model is being run. It can be a good complement to advanced model observability features, like drift detection. In industries that require transparency in key decision processes, easily accessible explainability of production models is a must. 

Here at Wallaroo, we are working to provide SHAP-based model explainability within our platform, in a way that provides a good user experience and useful information for data scientists and analysts. To make SHAP model explainability a regular tool in a production system, we want to be able to:

  • Run SHAP analysis on a particular model or model pipeline, either on-demand or on a regular schedule.
  • Run the analysis over predictions that were made in production, and not just over our training and test data.
  • Explain the effect of specific features, either overall, or on specific predictions.
  • Have an intuitive user interface both for submitting SHAP job requests and for analyzing/visualizing the results.
  • Do all of the above while minimizing the impact on production systems.

But running SHAP in a production environment is a bit more complicated than running it within a data scientist’s individual environment.

We won’t get into the details of how SHAP works in this article (there is a good explanation of Shapley values here, and the SHAP method here, except to say that to estimate the contribution of a given feature to an individual prediction, SHAP must consider a lot of hypothetical situations. Going back to our house example, if we want to explain the effect of sqft_living on the predicted price of House A, then the SHAP algorithm must consider what the model would predict on other hypothetical houses with the same value of sqft_living as House A, but different values of bedrooms, bathrooms, and so on. It must also consider houses with the same square footage and number of bedrooms as House A, but different numbers of bathrooms, floors, etc. The number of hypothetical situations can get large, especially as the number of features increases. Hence, SHAP can be quite computationally intensive. 

In addition, SHAP must have access to a running model’s inputs and inference logs, in order to generate synthetic datums with appropriate distributions, and estimate marginal effects correctly. The algorithm must also generate synthetic inferences using a model pipeline identical to the production model pipeline: that is, the inputs must go through not just the model, but any necessary pre and post-processing steps.

So, to run SHAP in a production environment, we must figure out a way to do it without negatively impacting Quality of Service guarantees on the model pipelines. We don’t want to cause resource contention. We don’t want to pollute the inference logs with synthetic data. 

Finding the best solution will be an architectural and engineering challenge, but we feel that the rewards are worth it. It’s the next logical step in the journey along that ML last mile.