See how Wallaroo.AI helps unlock AI at scale for Retail >

How to Accelerate Computer Vision Model Inference

computer vision image recognition | Wallaroo.AI
Image courtesy of Max Gruber

Accelerate computer vision model inference for real-time applications with Wallaroo, ensuring high performance without sacrificing accuracy. If you are having trouble deploying compute-intensive models for real-time or edge use cases, reach out to us at for a free consultation.

Computer Vision Model Inference Challenges

Although research in computer vision has been around for forty years, it took the advent of deep neural nets to make computer vision practical for more sophisticated tasks. Today, computer vision models are used for a variety of business applications, in many different verticals. 

In retail, computer vision is used for tasks like automated checkout, shopper movement, and inventory tracking. Heavy industry applications include parts inspection and robotic assembly. Computer vision is also used in healthcare to aid medical image analysis, and in agriculture for monitoring crop development.

While deep learning model models are highly effective for computer vision problems, they present a number of challenges for both development and deployment. Deep nets work so well in rich, unstructured task domains like vision because they can find and express complicated relationships and patterns in the data. To accomplish this, the model structure is often quite complex, with millions—and in some problem domains, billions—of parameters that capture concepts learned from high-volume data sets.

For example, the well-known ResNet50 image classification model has approximately 26 million parameters and was trained on the ImageNet data set of 14 million images. VGG16 has about 138 million parameters and was trained on the same dataset.

Techniques like transfer learning (fine-tuning pre-trained models) alleviate the training data volume and computational issues around training a deep net. But these large, complex computer vision models often have fairly heavy computational requirements during inference as well. This can make them hard to deploy in real-time, edge, or other production environments with either time or resource constraints.

In this article, we’ll look at some techniques for making computer vision models lean and efficient in production.

Quantization and Pruning

Quantization and pruning are two common ways to “trim some fat” off an existing model. 

Int8 quantization tries to reduce the size of a model by reducing the size of the weights of the neural net. These weights are typically represented as 32-bit floating-point numbers. Converting the floating-point representation to 8-bit (or even smaller) integers not only saves space, but changes mathematical calculations from floating point to integer. This results in a smaller and faster model. However, this model can also be less accurate.

In pruning, another program is used to search for the subset of the original network that contributes the most to the decisions that the model makes, and then tries to remove the parts that are not necessary for the task at hand. For example, a model trained on ImageNet data was trained on a thousand categories including many different breeds of dogs. If the part of the network that knows about dogs is not important to your use case, it may be possible to prune it out and save on space and computation. However, this comes at the cost of increased effort, as one must find the relevant parts of the network, and then restructure the network to account for the pruned parts.

Knowledge Distillation

Knowledge distillation, or knowledge transfer, is an interesting technique in which a small model is trained to reproduce the output of a larger one. It’s an approach used to build small and high-performant models for use in limited-resource environments, like mobile phones and embedded devices.

Knowledge distillation starts with a larger, “heavyweight” network, the teacher, that has been trained to learn the original task. Another, smaller, network, the student, is then trained to learn the same concepts from the teacher. This smaller model often has a similar network structure as the teacher, but with fewer parameters, as in this example. The student can be trained on the same training data as the teacher, or on a different data set.

Source: Gou, 2021

In response-based distillation, the most straightforward variation of knowledge transfer, the student model tries to reproduce the last layer of the teacher model. Other variations also try to reproduce intermediate layers or even the relationships between adjacent layers.

Knowledge distillation can produce smaller models that match, and sometimes exceed, the accuracy of their larger teacher models. However, this approach does require the overhead of an additional training round, and possibly some experimentation to find the best architecture for the student model. For a recent survey of knowledge distillation techniques, see [Gou, 2021].

Use A High-Performance Compute Engine

Rather than (or in addition to) downsizing the original model, you can speed up a model by running it on a high-performance compute engine, like Wallaroo. Wallaroo specializes in the last mile of the ML process: deployment. The Wallaroo platform is built around a high-performance, scalable Rust engine that is specialized for fast, high-volume computational tasks. The platform is designed to integrate smoothly into your data ecosystem and can be run on-prem, in the cloud, or at the edge.

With one customer who deploys computer vision models in a low-resource environment, Wallaroo was able to more than double the computer vision model inference throughput and reduce latency by almost half! In other cases, we have seen up to a 12.5X improvement in inference speed and an 80% reduction in the cost of computational resources.

With Wallaroo, businesses can easily deploy complex models and achieve the accuracy they need, without sacrificing speed and performance. There’s less need for additional, time-consuming model optimization steps.

Utilizing high-performance compute engines to accelerate computer vision model inference will ensure a seamless blend of accuracy and speed in your real-time applications.Y ou can read more about the Wallaroo platform at our blog. If you are interested in finding out how Wallaroo can improve your model deployment process, reach out to us at to learn more.  

Table of Contents



Related Blog Posts

Get Your AI Models Into Production, Fast.

Unblock your AI team with the easiest, fastest, and most flexible way to deploy AI without complexity or compromise. 

Keep up to date with the latest ML production news sign up for the Wallaroo.AI newsletter

Platform Learn how our unified platform enables ML deployment, serving, observability and optimization
Technology Get a deeper dive into the unique technology behind our ML production platform
Solutions See how our unified ML platform supports any model for any use case
Computer Vision (AI) Run even complex models in constrained environments, with hundreds or thousands of endpoints