The Challenges of Deploying High-Performance NLP Models

March 24, 2022

Deploying high-performance NLP (Natural Language Processing) models in production environments entails navigating a complex landscape of computational challenges and technical intricacies. This leap from the drawing board to the real world is where many projects stutter. If you want to explore how Wallaroo can help deploy your specific model use cases, email us at

What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is a branch of AI that focuses on enabling computers to “understand” human language.  Its roots go back to the 1950s, to Alan Turing’s famous “Turing test” which determined if a machine demonstrated human intelligence by whether a machine could behave in a way indistinguishable from a human. Such a machine would have to both understand human language as well as generate plausible-sounding language.

Natural Language Processing Use Cases

Common tasks in NLP include topic classification, named entity recognition, sentiment analysis, text summarization, question-answering, and machine translation. The global market for NLP applications is expected to reach about $43B by 2025.  As with anything, this subfield of AI faces its challenges. A few areas where issues arise are:

  • Sarcasm
  • Phrase ambiguity
  • Slang
  • Bias
  • Regional languages

As with Computer Vision, Natural Language Processing is an AI research and development domain that has benefited greatly from advances in deep learning. While early NLP systems used linguistic, rule-based, and eventually statistical approaches, the most successful modern NLP systems are based on deep neural nets. In this blog, we’ll discuss one of the most popular architectures for NLP models and the challenges of deploying it.

NLP Transformer Models

These days, state-of-the-art NLP models are generally built around the Transformer architecture [Vaswani, 2017]. This architecture was designed to address some of the difficulties in building NLP models. Before transformers, state-of-the-art NLP models processed text and speech sequentially (token by token), so that processing longer utterances took more computational effort. Transformer models can process input in parallel, reducing the dependency on sequence length.

Transformers are sequence-to-sequence models: they map N inputs x to N outputs y. In NLP, sequence-to-sequence models are used for tasks like document completion and machine translation, although transformers have been used for other applications, as well. 

Transformer models have two main segments: the encoder and the decoder. You can think of these segments as:

  1. The encoder takes the input sequence and encodes it into an intermediate representation
  2. The decoder then takes that internal representation and decodes it into the desired output sequence.
The transformer-model architecture. Source: Vaswani,, 2017
The transformer-model architecture. Source: Vaswani,, 2017

The encoder and decoder have a similar structure, based on what’s called attention. Self-attention (a particular kind of attention) enables a model to better understand how tokens (words) in a sequence are related to each other, thus reducing the ambiguity in natural-language utterances. For example, in the sentence “I couldn’t put the package in the car because it was too big,” you want the model to understand that “it” refers to “the package” and not “the car.”  This is achieved by learning a matrix of weights, such that the mutual weights between “it” and “package” are larger than the weights between “it” and “car.”

I couldn't put the package in the car because it was too big
A notional example of attention weights from “it” to other words. The weights from “it” to “package” and “big” should be larger than the weight of “car”.

Transformer attention blocks are multi-headed, meaning every attention block has the opportunity to learn multiple matrices of weights. This means one block can represent multiple attention relationships at once. These attention blocks (along with a feed-forward layer) are then stacked, allowing the encoder or the decoder to learn successively more abstract representations of the text, similar to the way a deep vision model can learn successively abstract representations of an image (edges, then regions, then semantically meaningful structures like hair, or eyes…).

Transformer models can represent contextual relationships between words across longer distances than previous models and can do so in parallel. The architecture still has some limitations—for instance, it can only deal with fixed-length sequences—but it represents an important breakthrough in the design of NLP models.  For a deeper and more detailed description of transformers, see this excellent introduction by Christian Versloot.

Transformers and other NLP Models in Production

While transformers are extremely powerful, they also tend to be complex and large, with millions or even billions of parameters. This can pose challenges when deploying them to production, sometimes even requiring specialized GPU production engines.

In addition, NLP models that work on written text have a special challenge: models work on numbers and can’t be fed text directly. The input to text models must be tokenized in order to convert the words (or tokens) into numbers. These numbers are then represented in the model via a representation like one-hot-encoding or word embeddings

Tokenization requires a vocabulary and a mapping from every word in that vocabulary to a number. Word embeddings encode semantic relationships amongst the words in a vocabulary, which must be learned from a training text corpus. You can create your own vocabulary and train your own embeddings, or you can take advantage of pre-existing tokenization and embeddings. Either way, the important part is that the tokenization and embedding that you use to represent text while training your model must be the same ones used to encode text when running the model in production.

Deploying high-performance NLP models to production with Wallaroo

If you are using NLP models with high computational requirements, or are running in environments with time constraints on inference, Wallaroo’s high-performance ML platform can help.

The Wallaroo ML platform includes an efficient, low-footprint, event-by-event machine learning model execution engine that is specialized for fast, high-volume computational tasks. Wallaroo specializes in the last mile of the ML process—deployment—and is designed to integrate smoothly into your data ecosystem, so your data scientists can continue to develop models in the environments they most prefer. The Wallaroo platform can be run on-prem, in the cloud, or at the edge.

Wallaroo ai platform

And because we at Wallaroo think hard about that last mile, our model pipelines support the inclusion of data processing steps (like tokenization) into our deployment process. Data scientists can upload a model pipeline, including processing steps, into Wallaroo and then deploy that pipeline into a production environment with just a few lines of Python. Comprehensive logging lets data scientists and ML Engineers monitor those models and pipelines in production, either via the Wallaroo dashboard or a visualization applications of their own choosing.

With Wallaroo, you get drastically reduced time-to-deployment and speedier inference. Typically, with customer transformer models and other NLP models, we have seen 5X – 12.5X faster analysis using 80% less infrastructure compared to the customer’s previous deployments. Deployments that previously required GPUs can now run efficiently on more standard CPUs, resulting in substantial savings.

Learn More

If you want to explore how Wallaroo can help revolutionize deploying your specific model use cases, email us at