synthetic data works. we should generate all the possible combinations.
combination of what (for our own case)?
the critique (an actual user, like those who came up with the ground truth commit, or dax) should rate the results to those combinations as good/bad. so we have an idea of what's acceptable and what's not.
just a pass/fail decision. no complex scoring scale.
a detailed rationale is required along the rate to understand the nuance and the depth of the rate.
a commit already has a ground truth in our own case. so the critiques would see similar outputs (generated by synthetic prompts) and rate it based on deviation, README changes and how much tolerance they have for the generated output compared to the ground truth.
few inspirations: https://hamel.dev/blog/posts/llm-judge/#examples-of-good-critiques
passes should capture the reason for pass and areas of improvements.
fails should capture the reason and critrea that led to failure.
the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge
a number scoring system is flawed. no one knows what 3 or 4 is and what do they represent exactly.
we should vibe-code a simple app that helps the critiques review the scores.
to grade outputs, people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria. We dub this phenomenon criteria drift, and it implies thatit is impossible to completely determine evaluation criteria prior to human judging of LLM outputs. https://dl.acm.org/doi/pdf/10.1145/3654777.3676450
here's an example of a good judge prompt with the critique examples which is not something we do at all. https://hamel.dev/blog/posts/llm-judge/#start-with-expert-examples
this approach is not something gosu or cobra took so it's the hard pill that we have to take anyway. it'd help us come to a conclusion of what a good agent is and at the same time, build a stable and fair evaluations.
a mistake i made in the semantic-similarity prompt is that i did not provide a context of the prompts that were generated by the planner. i'll test that and see if it improves the behaviour.
Not providing external context. Your examples should contain the same information you use to evaluate, including external information like user metadata, system information etc.
After you have created a LLM as a judge, you will have a dataset of user interactions with the AI, and the LLM’s judgments. If your metrics show an acceptable agreement between the domain expert and the LLM judge, you can apply this judge against real or synthetic interactions. After this, you can you calculate error rates for different dimensions of your data. You should calculate the error on unseen data only to make sure your aren’t getting biased results.
similar to this, we can talk to the owner of each eval once in a while and ask their thoughts on the judgements made by the llm judge and the corresponding generated output by the agent and see how much they agree with the critique. that's how we come up with an error/success rate.
Now that you have a sense for where the problems in your AI are, you can decide where and if to invest in more targeted LLM judges. For example, if you find that the AI has trouble with citing sources correctly, you can created a targeted eval for that. You might not even need a LLM judge for some errors (and use a code-based assertion instead).
another idea. the current semantic-similarity score rates between 0 and 1. we can divide it into multiple semantic-similarity-* judges that return a yes/no which by aggregating those binary responses we can create a rate that sounds like a fluid number.
a catch-22 situation. to grade outputs, people need to externalize and define their evaluation criteria. however, the process of grading outputs helps them to define those very criteria.
criteria drift. we have to grade the output ourself so we can define the criteria for an LLM judge.
auditing LLM outputs is challeging due to the nature of humans of over-generalization and over-reliance.
what strikes me right now is that the paper misses the ground truth aspect of our benchmark. the ground truth is what the user grades first and then creates criteria based on, but in our own case, the already defined ground truth diff based on what the developer has actually shipped, is available to us. so i do not know whether if it helps in our own case to show users few samples to grade, and then have them come up with a criteria.
one takeaway is selectivity. it'd measure how much a judge marks an output as a pass and not a failure. for instance, if a selectivity of a judge is low, which means it flags most outputs as failing, but then passes a certain output, then that means it's pretty valuable or vice versa. but a judge always passing or always failing is noise.
sampling. choosing which evaluations to be done by the user to identify the criteria. choosing randomly is not helpful since the evaluations won't represent what we're unsure about or what we're sure about. any non-random sampling is a win. just don't do it randomly.
the post notes that ground truth is hard to write. for our own case, it's easy, since that's how our dataset behaves. but the argument that it's hard to get when there are multiple answers/truths to the matter, i agree with that. since we only have one ideal solution to each problem because of the way we use each commit as a reference. we can give those who seed us data a voting seat on their own data. where they check what agents produced for each task and vote if it's good or bad even if it deviates from the ground truth (e.g. the actual commit diff). this way we can capture what's a deal breaker for users and what is not.
as mentioned in the post, grading notes can be brought into the dataset from what the users submit, to allow unambigiouties when allowed.