Large Language Models (LLMs) are a type of AI that can mimic human intelligence by using statistical models to analyze large amounts of data to learn the patterns between words and phrases. Many companies are racing to implement these AI-driven solutions to improve efficiency, reduce workload, lower costs, and enhance innovation that results in business value. LLM customization is how companies can take an off-the-shelf model and make it relevant to their use case. To select the right LLM integration approach for your application, it’s important to evaluate your use case needs and understand the nuanced differences between the options.
Thanks to the democratization of knowledge, it’s easier than ever to find information and opinions about AI and LLMs, but still hard to find qualified answers for your situation. In consulting conversations, our Head of Artificial Intelligence, Kevin Rohling, frequently encounters the misconception that fine-tuning a model is the only (or perhaps even the best) way for models to acquire new knowledge. Whether adding a co-pilot to your product or using an LLM to analyze a corpus of unstructured data sitting in a Box folder, your data and business context are important factors for choosing the right LLM approach. In most cases, your goals can be better achieved with strategies that are less operationally complex than fine-tuning, and significantly more robust to frequently changing datasets, while yielding more reliable and accurate results. In this post, we sat down with Kevin to dive into fine-tuning, learn how it works, explore its strengths and trade-offs, and discuss powerful alternatives that are often a better fit for many use cases.
What is Fine-Tuning?
“Fine-tuning a machine learning model” refers to the process of taking a pre-trained base model and further training it on a new dataset to improve its performance on a specific type of task. An example would be training a language model on The Stack, a popular dataset of code examples, to improve code generation. Another popular dataset is Flan 2, often used to give language models instruction following capabilities. When done with the right goals and datasets, fine-tuning can drastically improve model performance on targeted tasks. To see this in action, a quick look at the HuggingFace Open LLM Leaderboard reveals fine-tuned versions of Llama 2 models outperforming Meta’s original implementations, just days after Llama 2 became available. It’s easy to see that fine-tuning is valuable for certain use cases, but what exactly does fine-tuning entail?
How Do You Fine-Tune?
As any technologist knows, output quality is as good as input quality. To fine–tune an LLM, Kevin recommends starting by identifying your goals expressed in quantifiable metrics, selecting your pre-trained model, and preparing your dataset. In more detail, your fine-tuning preparation should look something like this:
Identify your goals in metrics: Determine what capabilities you’re trying to optimize for (coding performance, instruction following abilities, joke telling, etc.) and express these in terms of measurable metrics that can be monitored during training.
Select your pre-trained model: Select the smallest model that will meet your application’s performance requirements, which may require some experimentation. Additionally, evaluate the licensing of any candidate model to make sure it is a fit for your product and business. The recently released Llama 2 series of models is a popular option given the permissive licensing, performance, and variety of model sizes.
Prepare your high-quality dataset: Odds are that if you are finetuning a model for your business, you will bring your own dataset. This is an often overlooked and underestimated part of the process. Care should be taken to curate as many high-quality examples as possible and remove unnecessary or incorrect data, as these will have a significant, negative impact on the model’s ability to learn.
With these foundations in place, the real fine-tuning training process begins, and generally proceeds in these steps:
Load the pre-trained model that you want to fine-tune.
Configure your hyperparameters, such as the number of training epochs, batch size, and learning rate. These are important aspects of the training process that determine how many times a model sees the dataset, how big the updates to its weights should be at each step, and how many steps to take at a time.
Run your training loop. This is the code that actually executes the training process. HuggingFace’s Transformers has a Trainer class that can simplify this.
Evaluate the performance of the fine-tuned model on a validation dataset to assess its accuracy and quality. This is often done as part of your training loop at the end of each epoch. It’s very important to keep your validation dataset and training dataset separate and never mix them.
Iterate and optimize by fine-tuning the model multiple times, adjusting the hyperparameters, and training data as needed to improve the model’s performance.
Save the fine-tuned model for future use and deployment in your applications.
What are the Benefits of Fine-Tuning?
The primary benefit of fine-tuning is the ability to improve an existing, pre-trained base model’s performance on a specific set of capabilities or impart new capabilities (e.g. fine-tuned Llama 2 models outperforming Meta’s original implementation). Additionally, whereas training a new model from scratch is financially intractable for most businesses, fine-tuning is much more cost-efficient.
What are the Disadvantages of Fine-Tuning?
There are several disadvantages to fine-tuning models. For starters, fine-tuning a model is relatively expensive compared to some alternatives, and it has a steep learning curve for teams without strong ML/AI skills. On top of the expense, fine-tuning models requires re-training over time to integrate new data to maintain performance, which means a strategy must be in place for continuous learning to compensate for data and model drift. Additionally, fine-tuned models may experience Catastrophic Forgetting, a situation where the network forgets how to perform a previously learned task when it is trained on a new task, losing some of its useful capabilities. Another disadvantage is related to the speed of new technology. New, more powerful models are released with extremely high frequency. Once you have chosen a pre-trained model and fine-tuned it, it is not possible to integrate the advancements of future models into your resulting fine-tuned model. If you want to integrate the capabilities of new models in the future, you’ll have to repeat the fine-tuning process with those new models. Lastly, fine-tuning is unreliable for coaxing an LLM to learn factual information. Due to the tendency of LLMs to hallucinate (i.e. confidently give incorrect answers), fine-tuning a model on a corpus of data will not result in reliable responses across that dataset, which is a serious concern for business.
Considering the benefits and disadvantages of fine-tuning, it’s best not to use it for factual database applications. LLMs are best thought of as reasoning engines. To paraphrase Sam Altman: “If we think of LLMs as databases, they are the most unreliable and expensive databases we’ve ever built. However, if we think of them as reasoning engines, they are the most effective reasoning engines we’ve ever built.” (Weaviate.io)
What Resources are Required to Fine-Tune a Model?
Fine-tuning a model typically requires resources such as off-the-shelf, pre-trained language models like Llama 2, T5, or BERT; training or fine-tuning code (i.e. Trainer by HuggingFace); access to a GPU or multiple GPUs (depending on model size); adequate GPU memory to fit the model and data; time and cost for training and inference; optional tools like DeepSpeed for optimizing training and inference jobs; cleaned and prepared data for fine-tuning; and an environment set up with the necessary libraries and dependencies.
How Does Fine-Tuning Compare to Other Customization Approaches?
Fine-tuning isn’t the only option for customizing model outputs or integrating custom data, and it might not be the right one for your use case. Let’s explore some alternatives:
Prompt Engineering
Prompt Engineering is the process of providing detailed instructions or contextual data in the prompt sent to an AI model to increase the likelihood of receiving a desired output. Relative to fine-tuning, prompt engineering is significantly less operationally complex, and prompts can be modified and redeployed at any time without making any changes to the underlying model. Despite the relative simplicity of this strategy, it should still be approached with a data-driven mindset where the accuracy of various prompts is quantitatively evaluated to ensure the desired performance. Prompt Engineering isn’t without downsides. It cannot directly integrate large datasets, because prompts are typically manually modified and deployed. Additionally, Prompt Engineering cannot generate new behaviors or capabilities that were not present in the underlying model’s training data.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation is an approach to integrating large, unstructured datasets (i.e. documents) with LLMs that leverage semantic search and vector databases in combination with prompting mechanisms. While RAG is not a mechanism for generating new model capabilities, it is an extremely effective tool for integrating large, unstructured datasets. The biggest hindrance to RAG’s effectiveness is the limited context windows of many models, which in some cases, may prevent the model from receiving enough information to perform well. However, context windows are increasing quickly, with even some open-source models handling up to 32K tokens.
Want to try Retrieval-Augmented Generation (RAG) for yourself? Here are the steps:
Use an Embedding Model to vectorize your corpus of data and store the results in a Vector Database.
At request/inference time, use the same Embedding Model to turn the user’s question into a vector and use it to search your Vector Database for semantic matches.
Take the top N matches from your semantic search and put them into the LLM’s prompt along with the user’s question and send them to the LLM.
Data Privacy: How Do the Three Options Compare?
Fine-tuning has the disadvantage that information is stored in the parameters of the model. This means that it is not possible to silo data based on user permissions. Additionally, research has shown that prompt injection attacks can be formulated to extract training data from LLMs. As such, it should be assumed that any data used to train the model is available to any future user with prompt access to the model.
Prompt engineering has a much lower data security footprint as prompts can be siloed on a per-user basis. Care must be taken to ensure that any data included in a prompt is either non-sensitive or permissible for any user with access to the prompt.
Retrieval-augmented generation implementations are only as secure as the data access permissions present in the retrieval system. Care should be taken to ensure that underlying Vector Databases and prompt templates are configured with the appropriate privacy and data controls.
The Future: How Might LLM Advances Impact Fine-Tuning Approaches?
New options are on the horizon. Parameter Efficient Fine Tuning (PEFT) techniques such as LoRA and QLoRA have the potential to offer some of the benefits of fine-tuning combined but with significantly more flexibility. This is because PEFT techniques freeze the weights of the underlying model during training and instead train a much smaller matrix, called an Adapter. These adapters can be trained using a fraction of the resources required to fine-tune the whole model while still achieving comparable training performance. The resulting adapter is only 2-3% the size of the full model and can be quickly and easily “snapped” on and off of its larger counterpart to augment its capabilities at run time. This introduces opportunities for siloed or user-specific model capabilities. For applications requiring a number of customized LLM capabilities, maintaining and storing smaller lightweight adapters is much less complex and expensive than operating a large number of full-sized fine-tuned models.
Strategic Business Alignment and ROI Considerations
Whether you choose fine-tuning, prompt engineering, or RAG, your approach should align with your organization’s strategic goals, resources, expertise, and ROI considerations. It’s not solely about technical capabilities but how these approaches align with business strategies, timelines, current processes, and market demands. Understanding the complexities of fine-tuning options is key to informed decision-making in product development. It’s critical to work with partners who understand the multivariate fine-tuning landscape and choose appropriate customization techniques that are not only technologically sound but also appropriate for your business processes and goals.
Conclusion
Fine-tuning large language models is swiftly becoming a pivotal, powerful tool in the arsenal of modern digital product development. As Kevin's insights reveal, the process is not a one-size-fits-all choice, but a nuanced decision that requires a blend of technological understanding, strategic alignment, and cost and complexity considerations. From the operationally heavy fine-tuning to the agile prompt engineering and retrieval-augmented generation options, each approach offers unique benefits and challenges. Embracing new technology without disrupting business, or building the plane while flying it, requires technical acumen and a deep understanding of the business context, something Presence takes pride in providing. At Presence, our expertise in full stack digital product development, and our experience delivering secure, stable, innovative enterprise projects have given us the insight that every digital project, including AI/ML projects, is most successful when innovative approaches are chosen, not for buzz, but because they are appropriate for the use case and business objective.