You can’t hit what you can’t see. A structured approach to quality for AI Products.
By Kevin Rohling
Welcome to 2024! I find that every new year is a fun time to review what I’ve learned and ask how I can apply those learnings to future product development. At Presence, we’ve been fortunate to partner with amazing clients with genuinely exciting projects, including several GenAI projects in 2023. Looking back, working on these projects upgraded our structured way of thinking about quality when building products with AI components.
The Evolution of QA Strategies in AI Products
As with all of the work we do, defining a project’s testing and QA strategy is something best decided in the beginning, before any code gets written, for all the reasons (budgeting, resourcing, release timing, etc.). However, as AI systems gain more adoption, engineers and product teams (who are used to working with traditional software) are increasingly required to define quality requirements for what are essentially statistical systems.
Understanding the Unique Challenges of Quality Assurance in AI
To oversimplify a bit, we generally think about the quality of a software system in terms of Test Cases: “When I perform X Action under Y Conditions, does the expected post-condition Z occur?” Having faith that this assertion implies a properly functioning feature depends on the expectation of determinism. In other words, if I set up all the same pre-conditions in my production environment and execute the same steps, I expect that the same post-conditions will occur. We take this expectation for granted in traditional software so much that we don’t even think about it. When we do find a deviation, it’s almost always due to one of two things: 1) a genuine bug or 2) a set of pre-conditions we didn’t think about. The former generally gets filed as a ticket for the Developer to fix, and the latter will result in a new Test Case that we now need to handle as part of our quality assessment. The idea that a Test Case will pass sometimes undermines all of these assumptions!
Navigating Non-Determinism in GenAI Systems for Enhanced Product Quality
I will focus here on systems using only pre-trained models. If part of your system involves training models, the surface area for non-determinism increases. Even for “inference only” product features, non-determinism creeps into our product in several places:
How Model Parameters and User Inputs Affect AI Product Quality
Model Parameters: Different model architectures have different sets of inference parameters that influence the amount of randomness used to generate outputs. Temperature, for example, is a very common parameter in LLMs, while the choice of Guidance Scale is common for Diffusion Models. It’s important to note that the generation of “randomness” in a classical computer is actually “pseudo-random” and the values generated by these parameters can be controlled using what’s called a “seed” value. In other words, if you use the same “seed” value with the exact same inputs on the same model, you are guaranteed the exact same outputs (aka determinism!). OpenAI recently announced this as a new feature being added to their GPT-N models. As such, it’s actually possible to tune your generation parameters in combination with a seed value to end up with a deterministic system. But does this actually solve our problem?
User Inputs: In traditional software, the diversity of user inputs (and the resulting pre-conditions for a product feature) is of relatively low complexity. When a user inputs their email address into a field and hits “save,” the system generally isn’t expected to behave differently for different email addresses. When defining the sequence of steps in a Test Case, we don’t have to think about the diversity of all possible inputs for the textbox. In a product feature that leverages Generative AI, though, that isn’t necessarily the case. Instead, the system's behavior may be heavily dependent on the text in the textbox, and it may behave very differently. If you think about the space of all possible inputs to a text box, it’s astronomical, with each possible input potentially resulting in different behavior and a different set of outputs. How many different ways can you structure a sentence to ask ChatGPT for a muffin recipe?
Strategies for Handling Model Variations
Model Variations: When building a GenAI product feature, the choice of model to use is a fundamental technical decision, but it’s also one that is likely to change quickly. Given the extreme pace of change of this technology, it’s important that the systems we’re building today will work with new, more capable models that could be released before your product is. Every new model version will have a different set of weights (and potentially different architecture), meaning that the outputs will be different from the outputs from the previous version of the model. This is true even if you use the same ‘seed’ value previously mentioned to pin the randomness generation in the model. It’s also important to note that if you’re using a hosted model (i.e., via OpenAI), the model's weights could change at any time and without notice due to system enhancements on their end or new features. For more information about how OpenAI handles this, research their system_fingerprint parameter.
Embracing Statistical Systems in AI: A Paradigm Shift
So, what to do? It would seem that non-determinism isn’t something we can reasonably avoid when building these systems. My hypothesis is: Embrace the randomness. I have found that many people are extremely put off by the idea of non-determinism in software, and there are good reasons for that. For many systems (i.e. banking, nuclear missiles, etc.), inserting non-determinism is a terrible idea. On the other hand, for a system that summarizes last week’s news for me, I think it’s something I can accept. Arguably, an important part of our role as technologists here is to ensure that tools like GenAI aren’t used in systems that could turn cities and bank accounts into vast wastelands. Personally, I find the idea of working with non-deterministic systems exciting and see it as a genuinely interesting challenge, especially given that it’s the statistical nature of these systems that gives them their power. In fact, I’ve actually been working with human beings my entire life, which I have found to be a huge source of non-determinism and somehow I’ve gotten by.
A Structured Way of Thinking About Quality
Despite this new form of uncertainty in our software systems, we can still approach quality from a data-driven and quantitative mindset. In fact, the ideas here aren’t new, and they mirror the way Machine Learning engineers have been measuring the quality of their models for ages. I’ll break this down into 3 steps:
1. Start With User Stories
User Stories are a familiar starting point for most Product Managers and are often part of the documentation that gets produced during the Discovery phase of a new project. They usually go something like this:
As a User, when I click the Save button on the User Profile page, my profile information is successfully saved to the Database.
And the associated Test Case might be structured something like this:
Actor: Application User
Pre-Conditions: User Profile page is open and the form is populated
Data: Valid user profile data
Steps: Click the ‘Save’ button
Post-Conditions: User profile data is in the Database
We can do the same thing for a GenAI system:
As a User, when I ask the LLM for a muffin recipe, I am shown an easily understandable recipe for making muffins and a list of the required ingredients.
Actor: Application User
Pre-Conditions: LLM User Interface is open
Data: A prompt asking for a muffin recipe
Steps: Enter the user’s prompt and hit the ‘Send’ button
Post-Conditions: The user is shown an easily understandable recipe for making muffins and the required ingredients
2: Define Evaluation Metrics for Each User Story: A Balanced Approach
Evaluation Metrics are measurable, quantifiable values that describe the correctness of a system’s output. In traditional software this is akin to the idea of an assertion that generally evaluates to True or False. However, asserting the correctness of the outputs of a GenAI system is often more complex than just checking for data in a database.
In addition to being measurable and quantifiable, every Evaluation Metric must have a “Success Threshold”. This is a tough one, and it gets to the heart of the “works sometimes” nature of AI systems. Part of embracing the randomness is accepting that the system will behave the way you want some percentage of the time, but 100% is highly unlikely. It is up to the Product Owner to define what success looks like and these thresholds become a gate to production deployments. This is very similar to how in Computer Vision, a trained model is evaluated on a test dataset to determine its accurate rate. The business defines what accuracy is considered acceptable and new releases are blocked until that rate is achieved.
Here are a few we could use for this fictional Muffin Recipe LLM™ I’ve invented:
The system produced text is within the bounds of the desired text length 98% of the time.
The system produced text is <70 on the Flesch Reading Ease scale 90% of the time.
The system produced text includes a recipe for muffins 95% of the time.
The system produced text includes a list of ingredients appropriate for the given recipe 90% of the time.
The system produced text does not include irrelevant information 98% of the time.
Ok, nice! We can be fairly confident that if the output checks all of these boxes, then it works well enough to deploy to production. When defining Evaluation Metrics, it’s important to think through how these will be measured. The first two on our list, A and B, can be automated and easily measured. For the third one, C, we might be able to use an approach with embeddings to check if the generated text fits within the bounds of our known muffin recipes, but we would need to do some testing on that to make sure it works reliably. The last two, D and E, would be pretty tough to automate reliably, and we might need to use human reviewers. Each of these has varying levels of measurement difficulty, and it’s up to the team to decide which ones are worth the implementation tradeoffs and the timing around when they are implemented.
3. Build a (Living) Evaluation Dataset for AI Products
Your Evaluation Dataset is a living collection of inputs you will use to generate your Evaluation Metrics. It is living in the sense that it will grow over time, and bigger is better. The best inputs to use for evaluating the correctness of your application are inputs from your users and the absolute craziness they will come up with. If you recall from our earlier discussion about randomness, user inputs are the greatest source of non-determinism, and capturing this diversity in your dataset is important.
4. Leverage User Feedback and Data for Continuous Improvement
Two recommendations here:
Capture user feedback: Adding a simple thumbs-up/down button is a simple way to find problematic inputs your application is not handling well. Thumbs-down votes get added to the Evaluation Dataset.
Consider Synthetic Data: Early in development, before you have a deployed system, it may be difficult to compile a large enough dataset of human-generated inputs to give you confidence in your Evaluation Metrics. Synthetic inputs can be generated by other models (i.e. ChatGPT) and can make your evaluation more robust. Use this approach carefully, though, as synthetic data may not well represent the unique distribution of your users’ inputs and should ideally be phased out over time.
Case Study: Crafting a Quality AI Muffin Recipe Generator
Ok, so let’s put this all together for my muffin recipe generator:
User Story:
As a User, when I ask the LLM for a muffin recipe, I am shown an easily understandable recipe for making muffins along with a list of the required ingredients.
Test Case:
Actor: Application User
Pre-Conditions: LLM User Interface is open
Data: A prompt asking for a muffin recipe
Steps: Enter the user’s prompt and hit the ‘Send’ button
Post-Conditions: The user is shown an easily understandable recipe for making muffins and the required ingredients
Evaluation Metrics:
The system produced text is within the bounds of the desired text length 98% of the time.
The system produced text is <70 on the Flesch Reading Ease scale 90% of the time.
The system produced text includes a recipe for muffins 95% of the time.
The system produced text includes a list of ingredients appropriate for the given recipe 90% of the time.
The system produced text does not include irrelevant information 98% of the time.
Evaluation Dataset (thanks ChatGPT):
Could you guide me on a baking journey by sharing a unique muffin recipe?
I'm looking for a wholesome and nutritious muffin recipe. Do you have any suggestions?
I love exploring global cuisines. Do you have a muffin recipe from a specific country or culture?
Do you have a muffin recipe that's suitable for [gluten-free/vegan/keto] dietary preferences?
What's a good muffin recipe that incorporates seasonal ingredients for spring?
I'm craving something sweet! Could you recommend a decadent muffin recipe?
I'm baking with kids and need a simple, fun muffin recipe. Any ideas?
I enjoy the aromas of herbs and spices. Could you suggest a muffin recipe that uses aromatic ingredients?
I'm in the mood for savory flavors. Do you have a savory muffin recipe to share?
I'm fascinated by historical baking. Do you know of any traditional or vintage muffin recipes?
The Future of AI Products: Adapting to Change, Embracing Random
If you made it this far, you deserve an AI-generated muffin. Embrace the randomness and the muffin.
A muffin for you!