Machine Learning Model Experimentation Best Practices

October 11, 2021

Now that you have a model running in Production, how do you know it’s adding value for your business and customers? How do you know the parameters utilized in this model are better than other parameters? How do you know what you are doing is working better than what you had in production before? These are key questions you should ask yourself before productionizing any machine learning (ML) model.

Experimentation is crucial to the ML model-building strategy. Experiments may encompass using different training and testing data, models with differing hyperparameters, running different code (even if it’s a small change), and often you may find yourself running the same code but in different environment configurations. All experiments come with completely different metrics; consequently, many Data Scientists find themselves lost in keeping track of everything due to not following experiment best practices. Let’s get started on a few we have picked up along the way.


Why is version control important? First, it lowers any risk of erasing or writing over someone’s work or making mistakes. Second, it’s a great way to incorporate collaboration between colleagues. The most learned requirement in computer science is to ensure we have a mechanism for version control of our code, but in the Data Science and ML process, it’s more than just code requiring versioning. Notebooks, data, and the environment being utilized also need version control.

  • Notebook versioning. Versioning your notebook is a must for keeping track of, not only your code but also the results of each model run of experiments. If you intend on sharing and collaborating with your notebook, you will want to ensure you and your peers do not step on each other’s work or make mistakes.
  • Data versioning. Control of data is of utmost importance in ML. Data version control allows for managing large datasets, project reproducibility, and the ability for scientists to take advantage of new features while reusing existing features. Another advantage is users will not have to remember which model uses which dataset – this mitigates risk to model results. One way to have data version control is to save the incoming data in specific locations with metadata tagging (or labeling) and logging to be able to differentiate the old versus new.
  • Environment versioning. This type of versioning can mean a couple of things: infrastructure configuration and specific frameworks being used. You will want to have a good approach for versioning both types as this is also a crucial step in ensuring your experiments are being run 1-to-1. For example, if your experiments involve using TensorFlow, you will need to ensure this framework is imported for your research comparisons. Another example is when you want to promote your experiments from Development to a Staging environment and run automated tests. You would need to ensure the Staging environment matches all the configurations that were used in Development. A good practice is to create step-by-step instructions via a script or some automated process to avoid missteps.


Code commits require versioning to mitigate the risk of merging production code with non-production code, as well as avoiding the risk of overwriting your peers’ code and potentially making other detrimental mistakes. What happens if you run an experiment in between commits and forget to commit this code first? These are dubbed “dirty commits” which occurs when developers don’t follow development best practices. One best practice in this scenario is to have users create a snapshot of their environment and code before running an experiment. This way, they have the option of rolling back their changes to the code and configurations before experimenting.


All ML models have hyperparameters to help control the behavior of the training process of the algorithms and have a great impact on how the model will perform. To find the optimal combinations of parameters for the best results, you will find yourself running many experiments. In doing so, keeping track of the parameters you used for each experiment can become cumbersome; consequently, many scientists find themselves re-running experiments due to forgetting all the combinations used. A best practice for experimenting with hyperparameters is to incorporate a tracking process. One way to track is to log everything via audit logging or some form of logging that will save those parameters for every experiment.


What metrics should you track and save? Best practice: all of them. Metrics can change daily or over some time depending on the use case and situation. For example, measuring the performance of your current experiment may involve looking at a Confusion Matrix and distribution of predictions, but if you only logged the data from the distribution, you could miss out on remembering how the matrix performed and therefore waste time re-running the same experiment to gather this extra metric. Another example of metric loss is not tracking the timestamps of the data being collected; consequently, you may experience model decay and not be able to incorporate proper model retraining techniques. If you are only tracking specific metrics, you can miss out on discoveries; moreover, proactively logging as much in metrics as possible can help mitigate wasting time in the future.

A/B Testing

This form of testing is widely used by scientists to run different models against each other and compare their performance on real-time data, in a controlled environment. A best practice is to follow steps like the scientific method:

  • Form your hypothesis. For ML, you will want a null hypothesis (states that there is no difference between the control and variant groups) and an alternate hypothesis (the outcome you want your test to prove to be true).
  • Setup your control group and test group. Your control group would receive results from Model A, and your test group would receive results from Model B. You would then pull a sample of data via random sampling and from a specified sample size.
  • Perform A/B testing. How to run your A/B tests depend on your use case and requirements. We at Wallaroo provide three modes of experimenting for testing:
    • Random Split: This allows you to perform a randomized control trial type experiment where incoming data is sent to each model randomly. You can specify the percentage of requests each model receives by assigning a ‘weight’. Weights are automatically normalized into a percentage for you, so you don’t need to worry about them adding up to a particular value. You can also specify a meta key field to ensure consistent handling of grouped requests. For example, you can specify a split_key of ‘session_id’ to make sure that requests from the same session are handled by the same (randomly chosen) model.
    • Key Split: This allows you to specifically choose which model handles requests for a user (or group). For example, if using a credit card fraud use case, if you want all ‘gold’ card users to go to one fraud prediction model and all ‘black’ card users to go to another, then you should specify ‘card_type” to be the split_key.
    • Shadow Deploy: This allows you to test new models without removing the default/control model. This is particularly useful for “burn-in” testing a new model with real-world data without displacing the currently proven model.

Coming up with an effective experimentation strategy can be cumbersome but following some best practices will assist in proper planning. By including versioning, commit tracking, metrics, hyperparameter tracking, and A/B testing, you will be able to keep track of all information and results of your experiments to have the needed comparisons and confidence that you know which setup produced the best results.

About Wallaroo.

Wallaroo enables data scientists and ML engineers to deploy enterprise-level AI into production simpler, faster, and with incredible efficiency. Our platform provides powerful self-service tools, a purpose-built ultrafast engine for ML workflows, observability, and an experimentation framework. Wallaroo runs in cloud, on-prem, and edge environments while reducing infrastructure costs by 80 percent.

Wallaroo WorkFlow.png

Wallaroo’s unique approach to production AI gives any organization the desired fast time-to market, audited visibility, scalability – and ultimately measurable business value – from their AI-driven initiatives, and allows data scientists to focus on value creation, not low-level “plumbing.”