CodeMode support for AI-driven Eval Generation via Pydantic Monty #5
fswair
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
CodeMode: AI-Driven Eval Generation via Pydantic Monty
Problem
When an LLM generates evaluation specs (test cases with expected values), it guesses what the function should return. This is the single biggest source of eval failures — the LLM calculates
binary_search([1,3,5,7,9], 5)in its head and writesexpected: 3instead ofexpected: 2.The more complex the function, the worse the accuracy. Our measurements show ~58% first-pass accuracy on non-trivial functions. That means nearly half the generated test cases have wrong expected values.
Solution: CodeMode
Instead of guessing, let the LLM run the function and observe the real outputs.
CodeMode is a two-phase pipeline:
Phase 1 — Exploration
The LLM writes small Python snippets that call the target function with various inputs. Each snippet is executed in a sandbox and the real outputs are collected.
The LLM can now see empirically: edge cases, exception boundaries, return types, off-by-one behavior — all from running real code, not mental arithmetic.
Phase 2 — Spec Generation
The verified results are fed back to the LLM. It generates the YAML spec using exact outputs, not guesses:
Sandbox: Pydantic Monty
We use pydantic-monty as the execution backend — a Rust-based Python interpreter that provides:
start → snapshot → resume) with properasynciointegrationThe key insight: Monty sandboxes the test orchestration but delegates
target_func(...)calls back to the host. So the real function can use numpy, pandas, anything — while the exploration code itself can't escape the sandbox.Executor Protocol
Two backends:
MontyExecutor— sandboxed, production-grade, usesrun_monty_asyncDefaultExecutor—exec()-based fallback for development (no sandbox)Both produce identical results.
get_executor("auto")picks Monty when available.How This Strengthens TDD
vowel's TDD pipeline currently works like this:
The problem in step 2 is accuracy. CodeMode slots in between steps 1 and 2:
This changes the LLM's role from "calculate expected values in your head" to "design good test scenarios, the execution engine will give you the answers."
Observed Results
For
binary_searchwith Gemini Flash:Status
Executorprotocol +MontyExecutor+DefaultExecutorCodeModeGeneratortwo-phase pipelineTDDGenerator)vowel-optimization)Install
Quick Example
Beta Was this translation helpful? Give feedback.
All reactions