Delivering AI Value with Wallaroo Observability and Model Insights

January 26, 2022

Wallaroo’s vision is to provide the easiest way to deploy, run, and observe ML in production at scale – and to help enterprises use ML to impact their bottom line (read more here). To that end, we believe that observability and advanced model insights are key components of the product. These features let ML teams quickly identify model underperformance or production system bottlenecks so they can adapt and iterate as needed to drive business value. In addition, observability provides the detailed audit trail needed for compliance and risk management. 

Naturally, Wallaroo provides the basic observability that you would expect:

  • User and System logs, including what was deployed by whom, timestamps, throughput, latency of system performance, etc.
  • Real-time audit trail of data consumed in production, model, and version that performed inference, resultant prediction, timestamp, compute duration, etc.
  • Data Validation checks on the data that is being input to a model, on a model-by-model basis

In addition to that, we also provide a new feature: Model Insights. Our Model Insights feature lets you monitor how the environment that your model operates within may be changing in ways that affect the bottom line, so you can intervene in a timely manner. This post delves into Model Insights a bit more deeply.

Models Have Assumptions

In machine learning, we use data and known answers to train a model to make predictions for new previously unseen data. We do this with the assumption that the future unseen data will be similar to the data used during training: the future will look somewhat like the past.

This isn’t completely true, of course, and a good model should be robust to some amount of change in the environment; however, if the environment changes too much, your models may no longer be making the correct decisions. This situation is known as concept drift; too much drift can obsolete your models, requiring periodic retraining.

How do you detect that concept drift is causing a problem? One way is to compare model predictions with actual outcomes. This may not always be possible or have too much of a lag. For example, once a credit card fraud model rejects a transaction, there’s no way to tell if that transaction was actually fraud. If your model is rejecting too many good transactions, you might not find out until the complaints start rolling in — far too late!

Detecting Concept Drift

Can you detect the problem with your credit card fraud model before your customers start complaining? This is where Model Insights comes in. Model Insights tries to catch concept drift by monitoring data drift: whether the data going into a model, or the model’s predictions themselves, have drifted too far away from the model’s expectations. We do that by comparing the distribution of incoming data and/or model predictions to baseline distributions that describe what the model expects.

For example, suppose your credit card fraud model returns the probability (a number from zero to 1) that a transaction is a fraud. Using Wallaroo’s Model Insights, you can establish a baseline distribution of the scores that the model returns. This baseline describes what the model “usually does.” Wallaroo can then continuously monitor the distribution of scores over user-selected time windows and compare the current distribution to the baseline.

In the chart above, Period A represents the baseline distribution of fraud model scores. Period B represents the distribution of scores at some later time window. The two distributions aren’t identical, but they’re close: there hasn’t been a lot of concept drift, and the model appears to be working as expected.

But things can change. Let’s say that according to the model, purchases of expensive electronics that are sent to an address that is different from the shipping address are more likely to be fraudulent. This might have been a reasonable assumption when the model was trained, but it may not be now. These days, more people may be working or studying at home (and therefore require more expensive electronics), and they may be sending or gifting more electronics to family in other households (a new laptop for the kid at college, new smartphones, or tablets for grandparents to teleconference with their grandkids).

This change in consumer behavior could cause the model to give higher fraud scores to more legitimate purchases. Rejecting too many legitimate purchases would be bad for your business.

In the above chart, we are now comparing the model’s baseline distribution to a more recent time window, Period C. The change in consumer behavior has caused the fraud model to give legitimate transactions higher scores, changing the score distribution. This overall upward trend could cause the model to erroneously reject too many legitimate transactions. 

If you catch this in time, you can investigate the cause and intervene before bad decisions cause you problems. 

The ability to catch a change in model behavior by noticing the changing score distribution can potentially save your business from costly errors. Wallaroo Model Insights can also monitor the distribution of key model inputs, as well. In this example, monitoring the distribution of purchase size for transactions involving electronic goods could detect a change in consumer behavior that is potentially causing concept drift.

How We Do It

There are many useful statistical tests for calculating the difference between distributions; however, they typically require assumptions about the underlying distributions or confusing and expensive calculations. We’ve implemented a data drift framework that is easy to understand, fast to compute, runs in an automated fashion, and is extensible to many specific use cases.

Our methodology currently revolves around calculating the specific percentile-based bins of the baseline distribution and measuring how future distributions fall into these bins. This approach is both visually intuitive and supports an easy-to-calculate difference score between distributions. Users can tune the scoring mechanism to emphasize different regions of the distribution: for example, you may only care if there is a change in the top 20th percentile of the distribution, compared to the baseline. A special interactive mode helps you explore model behavior and tune parameters to be just right.

The Wallaroo Model Insights Framework

The Wallaroo Model Insights framework supports the model monitoring that we’ve described above, in an intuitive and computationally efficient way.

You can specify the inputs or outputs that you want to monitor and the data to use for your baselines. You can also specify how often you want to monitor distributions and set parameters to define what constitutes a meaningful change in distribution for your application. 

Once you’ve set up a monitoring task, called an assay, comparisons against your baseline are then run automatically on a scheduled basis. You can be notified if the system notices any abnormally different behavior. The framework also allows you to quickly investigate the cause of any unexpected drifts in your predictions. More on that in a future article.

Talk to us

In this article, we’ve discussed how concept drift could adversely affect your bottom line and underlined the importance of monitoring your model’s inputs and outputs to detect it. In combination with Wallaroo’s Data Validation checks, Model Insights can help make sure that your AI-driven processes are running smoothly and error-free.

If you want to explore this further with us, such as access to the full SDK, giving us specific feedback about functionality/semantics/integration, or discussing how it can help with your use case, email us at

You can also find more blogs about Wallaroo: