5 learnings from classifying 500k customer messages with LLMs vs traditional ML
LLMs can help you achieve state-of-the-art results on text classification tasks (like sentiment analysis, labeling, etc) just by describing the list of classes and descriptions.
We recently reached our 500k classification milestone using LLMs and fine-tuned models 🎉, and below we share what we've learned.
1. LLMs prefer outputting something over nothing.
LLMs are trained to generate any text, so not generating anything is harder for them to do. This was a major cause for false-positives that we managed to correct once we added a catch-all class like "other" or "none-of-these".
2. Hallucinations are more helpful than you think
Keeping track of hallucinations can help you understand what good or bad class names are. In the example above you can see how the label "technical-issue" is hallucinated by the LLM, even though the actual label name in the prompt is "other/technical-issue".
Try to keep the class names as simple and direct as possible. Note that there may be other techniques that allow the LLM to not bias towards using the class name itself by replacing class names with symbols, called symbol tuning -- but this research is still ongoing.
3. To save costs and improve latency, fine tuned classification models are the answer. Combining them with LLMs can be even more powerful.
One customer needed to process some data at lower-latency than LLMs, and the only way we could do this was by forgoing ChatGPT usage altogether. To do this, we trained SBERT using contrastive learning on ChatGPT-labeled data to achieve around 85% parity with it on multi-label classification on the customer’s 22 different classes, and greater than 90% on a subset of those classes. The cost breakdown is below
4. LLM reasoning improves accuracy
This is a more well known prompt engineering technique, but it goes without saying — Chain of thought Prompting will result in higher accuracy when doing text classification ( https://arxiv.org/pdf/2305.08377.pdf). Prompting the LLM to extract a set of clues before doing the classification can yield state-of-the-art accuracy (96%+ on benchmarks). Below is an example from the paper
5. Standardizing input is key for both fine tuned models and LLMs.
The more text there is to classify, the less accurate the prediction.
Imagine you have a long running chat thread between you bot and a user. At every step, you’re attempting to determine the intent of the user. To improve prediction accuracy, we apply a preprocessing step to paraphrase the last user message with the previous context. This helps with classifying chat messages with multiple contexts, emails, long documents, and non-english messages.
We built Gloo to help others solve text-classification problems in a more automated way. Gloo allows you to
- Create an LLM-classifier with prompt engineering, that is guaranteed to only output your specified classes.
- Quantifiably measure how prompt changes impacts production (latency, accuracy, biases in class selection)
- Train + deploy a traditional BERT based classifier based on your LLM data
- Build a classifier that combines both the trained model + LLM for any new classes you didn’t train on
** If you are interested in improving the accuracy, latency, or cost of your NLP text classifier, reach out to us: email@example.com and we'll get you started with a free trial **