Unlocking Scalable ML: A Deep Dive into Autoscaling in MLOps
May 6, 2022Autoscaling in MLOps is a big deal for both tech wizards behind the scenes and data scientists. This magic ensures everything runs smoothly, saves money, and lets data scientists focus on the cool stuff like crafting intelligent models, instead of worrying about the nitty-gritty of resource management.
As the name suggests – “auto” “scaling” is an automation capability used for managing computational resources for a given technology environment. Autoscaling allows resource availability and utilization to be adjusted dynamically in response to incoming data traffic and user workloads.
Autoscaling requires minimum human interaction to manage the provisioning and utilization of computational resources necessary for running a given software application. If resources are at peak capacity – autoscaling automatically provisions more resources to ensure there are always enough resources for a given compute demand. Inversely, autoscaling allows decommissioning resources that are no longer needed to support a given application
So now that we have an understanding of what autoscaling in MLOps is – let‘s dive into the value that auto scaling provides in the context of MLOps both for data scientists and for ML/Infrastructure engineers. For enterprises with maturing infrastructure operations, autoscaling offers three key benefits:
- Usability: Users of software applications can benefit from fast response times and optimal user experience. This is achieved by ensuring that the underlying infrastructure accommodates the load and user requests at any given point.
- Operational efficiency: Infrastructure engineers don’t want to monitor dashboards all day in order to know when to add or remove computational resources. By applying fairly simple configurations, compute resources are allocated dynamically depending on incoming traffic (users, data etc.) and demand.
- Cost savings: Only pay for resources needed when they are needed. This results in significant savings in the long run for enterprises operating on a large scale.
In the context of AI/ML initiatives, managing computational resources is extremely important to support the processing of large amounts of data for training or deploying complex machine learning models. In this case, it’s important to ensure that data scientists have an optimal user experience via an optimized utilization of computational resources at their disposal to be able to iterate quickly on experiments. ML or infrastructure engineers on the other hand, need to be able to efficiently support data scientists with dynamic allocation of computational resources to support daily ML operations.
Most MLOps deployment solutions leverage the autoscaling capabilities of their host environment, e.g., AWS Auto Scaling, but these are often repurposed tools from DevOps which certainly help optimize overall infrastructure but don’t quite fit the unique needs of data scientists when deploying their Machine Learning models to production. These tools might be able to scale for the overall infrastructure load demands but if a model is deployed inside a container, it cannot auto-scale the allocated compute to the specific complexity of the model or models running inside that container. What this could mean is the container could have either too little compute and thus run inference too slowly, which results in a poor user experience for data scientists looking to deploy and manage models at scale in production. Inversely, the container may have too much compute allocated and thus results in simple models utilizing more resources than needed to run.
How does Wallaroo support autoscaling in MLOps?
Wallaroo supports two forms of autoscaling:
- Wallaroo underlying compute autoscaling: Scaling the underlying resources up and down based on overall load
- Wallaroo engine autoscaling: Scaling the resources allocated to each ML pipeline inside Wallaroo based on the specific load for each model or inference
Underlying compute autoscaling involves adding or removing virtual machines to/from the nodes available to the Wallaroo cluster. This is handled by Kubernetes and the various cloud providers. Each cloud or computational infrastructure provider offers slightly different controls, but in general resources can be dynamically added if the CPU utilization of the existing resources exceeds a certain threshold and resources can be removed when the utilization falls below a given threshold for long enough.
Wallaroo engine autoscaling helps optimize hardware utilization by automatically adjusting the resources allocated to each ML pipeline based on load relative to other pipelines (to learn more about Wallaroo pipelines, click here). For example, if a recommendation engine has models trained for different geographic regions, the amount of traffic to each region-specific pipeline will vary during the course of a day. Thus, even if the number of allocated resources required remains constant during the day, the amount of traffic to each pipeline may vary considerably. The Wallaroo engine scales the number of CPUs allocated to each pipeline accordingly.
As a result, Wallaroo will allocate computational resources depending on incoming data and user traffic in the platform. Additionally, when deploying multiple ML pipelines on a large scale, the Wallaroo engine has the ability to optimize the utilization of allocated computational resources across all active pipelines within. This guarantees a seamless user experience for data scientists deploying and managing models at scale, who will be able to continue to get real-time insights on their deployed models with no interruption.
Interested in seeing how Wallaroo’s autoscaling can generate faster results while saving compute costs? Try our free Community Edition.