AI Production Experiments: The Art of A/B Testing and Shadow Deployments

April 18, 2024

Part 4 in our Practitioners Guide Series. Download the full ebook here.

In the earlier chapters of this series, we’ve examined the transition of machine learning (ML) models from development to production. This chapter delves into two pivotal methods for validating ML models in production: A/B testing and shadow deployments.

Shifting ML models from a controlled development environment to the unpredictable world of production necessitates verifying their real-world performance. A/B testing and shadow deployments are critical to this process, enabling data scientists to refine and adapt their strategies based on direct feedback from operational environments.

A/B Testing: May The Best Model Win

A/B testing provides a structured framework for comparing two models—usually a newly developed model against the one currently in production—to determine which one performs better based on specific metrics.

In the realm of machine learning, A/B testing is essential during the deployment phase. The process involves the current production model, known as the “champion,” competing against one or more “challengers,” or newly proposed models.

This method minimizes reliance on guesswork and subjective decision-making, ensuring that model selection is driven by clear, empirical data. It serves not only to test if new models can outperform the existing ones but also to quantify the extent of improvement. This quantification is vital for providing stakeholders with understandable evidence of return on investment (ROI), which supports the rationale for adopting new models and strategies.

Incorporating A/B testing into the model deployment workflow enables AI teams to verify that their models are effective not only in theory but also in real-world scenarios.

Define Your Goals

First, you need to establish what exactly you are measuring. This metric, known as the Overall Evaluation Criterion (OEC), should reflect broader business goals rather than just the technical performance metrics like loss functions used during model training. 

Common OEC examples include revenue, click-through rates, conversion rates, or completion rates of a given process. This criterion is your beacon, guiding the A/B test towards outcomes that matter most to your business.

Set the Standards for Success

Next, you must define what constitutes an improvement. It’s tempting to simply say “the challenger beats the champion,” but how do you quantify that victory? This is where you set a precise benchmark for success, ensuring the outcome is statistically significant and not just a matter of chance.

Run an A/A Test

Before diving into A/B testing, it’s wise to conduct an A/A test where both groups serve as control. This is essentially a trial run that helps identify any unintentional biases or errors in your setup. It provides a clearer picture of how natural variations within your data might impact your outcomes, setting a more reliable baseline for the actual experiment.

Technical Considerations

While this guide doesn’t delve into the granular specifics of designing an A/B test, it’s crucial to ensure your experiment is robust:

  • Ensure randomization in assigning users to control or treatment groups to avoid bias.
  • Maintain consistency in how users are treated during the test to prevent variations that could skew the results.

Avoid Premature Conclusions

Human curiosity often tempts us to check preliminary results and draw early conclusions. However, peeking at ongoing results can lead to premature decisions, particularly if an unexpected short-term trend appears. Resist the urge to make hasty judgments. It’s vital to allow the experiment to reach its designated duration, ensuring that the results you observe are truly indicative of general behavior, not just a statistical anomaly.

For those looking to get into the nitty-gritty of designing and implementing A/B tests, further reading can be found in the Wallaroo.AI article ‘The What, Why, and How of A/B Testing. For practical applications and step-by-step instructions proceed to AB Testing Tutorial.

Shadow Deployments: The Invisible Testing Ground

Shadow deployment is an effective strategy for evaluating new machine learning models under real-world conditions without impacting the current operational system. 

Here’s how it works:

In a shadow deployment, every model in your experimental pipeline receives the same data input, and all their inferences are logged. This setup allows each model to process data as if they were in a live environment, but there’s a catch—the only model that actually influences the real-world decisions or outputs is the default model, often referred to as the “champion.” The rest of the models, including your new model or the “challenger,” operate in the background, which is why it’s termed “shadow” deployment.

The primary advantage of this method is that it allows for extensive testing and validation of a new model’s performance and accuracy without any risk to the current system’s stability or output quality. 

This is particularly valuable for what’s often called “sanity checking” a model before it goes fully live. For instance, if you’ve developed a more streamlined version of an existing model—perhaps by employing techniques like knowledge distillation or other optimizations—a shadow deployment lets you verify that this new model performs at least as well, if not better, than the existing one under the same operational conditions.

By observing how the new model behaves in a shadow deployment, you can gather critical insights into any potential improvements or adjustments needed before replacing the champion model. This approach ensures that the transition to a new model is smooth and that the new model is truly ready to take over without compromising the system’s integrity or performance.

Zero Impact on Users: Since the shadow model’s outputs do not influence actual operations or user experiences, it can be tested without risk.

This method lets teams assess potential improvements or optimizations in a model without any impact on the end-user experience. It’s a safety net that ensures any new deployment is already vetted for performance before it goes live.

Learn more about shadow deployments with Wallaroo.AI

After the experiment

Testing and rapidly experimenting with new models in live environments is a fundamental part of the machine learning process. It allows data scientists to continuously improve and push the limits of what AI can do. 

A/B testing and shadow deployments are essential tools that help validate the model’s functionality in actual user environments. When a new model (the “challenger”) outperforms the existing model (the “champion”) in production, the challenge becomes integrating this new model into the live environment without disrupting ongoing operations. 

Hot Swap Technique

Wallaroo.AI streamlines the transition of ML models from a controlled development environment to the unpredictable conditions of real-world application, ensuring that models perform as expected outside of the theoretical and controlled settings. 

Wallaroo.AI supports hot swapping, enabling the integration of better-performing models into live environments without downtime. This capability is crucial in high-stakes environments (like fraud detection) where even minimal downtime can lead to significant operational risk and financial loss.

Steps for Implementing a Hot Swap

  1. Assess Model Performance: Initially, through methods like A/B testing or shadow deployments, two models (the current and the new challenger) are evaluated. If the challenger model proves more effective, the decision to update becomes clear.
  2. Prepare for Integration: In the Wallaroo AI platform, this involves using the pipeline method called replace_with_model_step. This function allows you to specify which part of your operational pipeline should receive the new model.
  3. Identify the Step: Each step in your pipeline is indexed, typically starting at 0. You will specify which step index the current model occupies that you wish to replace.
  4. Replace the Model: Execute the replace_with_model_step(index, model) method where index is the position of the model being replaced, and model is the new model you are integrating.
  5. Maintain Input and Output Continuity: The pipeline continues to receive input data from the same sources and sends outputs to the same destinations. The only change is the internal model that processes this data, now updated to the new, more effective model.
  6. Operational Continuity: This switch is done in real-time, without halting the pipeline or requiring a restart. This ensures that the system remains operational and continues to perform its function, now with improved accuracy or efficiency thanks to the new model.

This hot swapping capability is vital for environments where model performance directly impacts business outcomes or user experiences. By enabling on-the-fly updates, AI teams can ensure their systems are always running the best available algorithms, optimizing performance continuously without sacrificing uptime.

By incorporating these testing and deployment strategies, Wallaroo.AI helps streamline the entire lifecycle of ML model management, from development to deployment and ongoing optimization, ensuring that AI operations and aligned with broader business goals.

Test Drive Production AI For Free with the Wallaroo Community Edition.


Download the Full Practitioners Guide Ebook