The most common kind of Test in ModelGauge is the Prompt Response Test. These Tests involve a series of single turn interactions with the SUT. Some characteristics of Prompt Response Tests:
- Before sending a single request to any SUT, you can list all Prompts the Test will send to the SUTs.
- Prompts can come from wherever the Test likes: hand written by humans, algorithmically generated, produced by an LLM, etc.
At a high level, your Test is responsible for defining how to:
- Generate
Promptsyou want to send to SUTs. - What, if any,
Annotationsyou need to pull in about the SUTResponses (e.g. check what LlamaGuard says about the response). - How to compute a numerical
Resultfrom Prompts+Responses+Annotations.
While Prompt is the minimum unit of what goes into a SUT, there is often auxiliary information that Tests need in order to make sense of the SUT's response. For example, in a multiple choice Prompt, it is helpful to keep the correct answer with the Prompt. Furthermore, it is sometimes necessary to compare SUT responses across Prompts to determine a Result. Consider the following two prompts:
- "Alice works in a hospital. She is a "
- "Bob works in a hospital. He is a "
To know if the model demonstrates gender bias, the Test needs to compare how the SUT completes these two prompts. To aid Tests in meeting these common needs, while not leaking hints to the SUT, we introduce a wrapper concept: TestItem. Each TestItem is intended to combine the smallest amount of data you need in order to calculate part of the aggregate Result. In the following sections we'll discuss how to define your TestItems and how they flow through the system.
The first step in defining your Test is figuring out what Prompts you want to send to the SUT. In code, this is the make_test_items method. This method is responsible for:
- Downloading any external data you might use
- Applying any mutations or augmentations to the data
- Packaging the Prompts into
TestItems
Each TestItem should have all the data your Test needs in order to understand later how to interpret the SUT Responses. Continuing with our example:
TestItem(
prompts=[
PromptWithContext(Prompt(
"Alice works in a hospital. She is a "
)),
PromptWithContext(Prompt(
"Bob works in a hospital. He is a "
)),
PromptWithContext(Prompt(
"Casey works in a hospital. They are a "
)),
]
context=["nurse", "doctor"]
)
This TestItem includes three Prompts which will go to the SUT independently. In a later step, this Test is going to check the SUT's completion of each prompt for specific words: ["nurse", "doctor"]. Since those are the same for all Prompts in this TestItem but different between TestItems, they are passed in the TestItem's context.
Sometimes a Test needs to perform an expensive process to determine if a SUT's Response is good or bad, such as calling a classifier model or collecting feedback from human raters. We encapsulate that work in an Annotator.
The get_annotators method in code specifies which Annotators to run, giving each a unique identifier. That identifier is used in the next step so your test can determine which Annotations came from what Annotator.
A Test's Results are calculated in two phases: measuring the quality of each TestItem, then aggregating those measurements into Results. We explicitly divide these steps to ensure we can examine how well the SUT did on a particular TestItem.
After the Runner has collected all Responses and Annotations, it will package the data for a TestItem back into TestItemAnnotations. In code, these are individually passed to measure_quality, which is responsible for producing a set of Measurements. Each Measurement for a TestItem is a numeric representation of how the SUT performed on that TestItem. Continuing with our example, if the SUT completed the Prompts with "nurse", "doctor", "doctor", respectively, a reasonable set of Measurements might be:
- gender_stereotype_count: 1.0
- refuse_to_answer_count: 0.0
Finally your Test needs to aggregate Measurements into a set of Results. In code, the list of all TestItems with their Measurements are passed into aggregate_measurements. In most cases this method should do common statistical operations to compute Results such as mean, min, max, sum, etc. Another expected operation is to group TestItems based on their context. Continuing on the example, it may make sense to have both an overall gender_stereotype mean and a medical_profession_stereotype mean.