CodeMode support for AI-driven Eval Generation via Pydantic Monty #5

fswair · 2026-03-02T20:12:47Z

fswair
Mar 2, 2026
Maintainer

CodeMode: AI-Driven Eval Generation via Pydantic Monty

Problem

When an LLM generates evaluation specs (test cases with expected values), it guesses what the function should return. This is the single biggest source of eval failures — the LLM calculates binary_search([1,3,5,7,9], 5) in its head and writes expected: 3 instead of expected: 2.

The more complex the function, the worse the accuracy. Our measurements show ~58% first-pass accuracy on non-trivial functions. That means nearly half the generated test cases have wrong expected values.

Solution: CodeMode

Instead of guessing, let the LLM run the function and observe the real outputs.

CodeMode is a two-phase pipeline:

Phase 1 — Exploration

The LLM writes small Python snippets that call the target function with various inputs. Each snippet is executed in a sandbox and the real outputs are collected.

# LLM writes these exploration snippets:
target_func([1, 3, 5, 7, 9], 5)    # → 2
target_func([], 1)                   # → -1
target_func([1], 1)                  # → 0
target_func("not_a_list", 5)         # → TypeError

The LLM can now see empirically: edge cases, exception boundaries, return types, off-by-one behavior — all from running real code, not mental arithmetic.

Phase 2 — Spec Generation

The verified results are fed back to the LLM. It generates the YAML spec using exact outputs, not guesses:

binary_search:
  evals:
    ReturnType:
      type: "int"
    ValidIndexOrNotFound:
      assertion: "output == -1 or (0 <= output < len(input[0]) and input[0][output] == input[1])"
  dataset:
    - case:
        id: "middle_element"
        inputs: [[1, 3, 5, 7, 9], 5]
        expected: 2         # ← verified, not guessed
    - case:
        id: "empty_list"
        inputs: [[], 1]
        expected: -1
    - case:
        id: "type_error"
        inputs: [[1, 2, 3], "a"]
        raises: "TypeError"  # ← discovered via execution

Sandbox: Pydantic Monty

We use pydantic-monty as the execution backend — a Rust-based Python interpreter that provides:

Strict isolation: no filesystem, no network, no environment variables
Fast startup: <0.1ms per snippet (code is pre-parsed)
External function callbacks: the target function runs on the host with full stdlib/library access, while the orchestration code runs sandboxed
Resource limits: max duration, max memory, max recursion depth
Native async: step protocol (start → snapshot → resume) with proper asyncio integration

The key insight: Monty sandboxes the test orchestration but delegates target_func(...) calls back to the host. So the real function can use numpy, pandas, anything — while the exploration code itself can't escape the sandbox.

Executor Protocol

class Executor(Protocol):
    async def execute(
        self,
        code: str,
        *,
        inputs: dict[str, Any] | None = None,
        external_functions: dict[str, Callable] | None = None,
        timeout: float = 5.0,
        max_memory: int = 10 * 1024 * 1024,
    ) -> ExecutionResult: ...

Two backends:

MontyExecutor — sandboxed, production-grade, uses run_monty_async
DefaultExecutor — exec()-based fallback for development (no sandbox)

Both produce identical results. get_executor("auto") picks Monty when available.

How This Strengthens TDD

vowel's TDD pipeline currently works like this:

Discover functions → extract signatures
LLM generates YAML eval specs (test cases)
Run evals → check coverage → retry if needed

The problem in step 2 is accuracy. CodeMode slots in between steps 1 and 2:

1. Discover functions
1.5 [NEW] CodeMode: explore function behavior via sandbox execution
2. Generate YAML spec with VERIFIED expected values
3. Run evals → check coverage

This changes the LLM's role from "calculate expected values in your head" to "design good test scenarios, the execution engine will give you the answers."

Observed Results

For binary_search with Gemini Flash:

Metric	Value
Exploration snippets	14
Execution time (all snippets)	~50ms
Cases discovered	happy path, boundary, edge, error, duplicate, negative
Generated spec accuracy	16/16 cases correct
Eval coverage	100%
Total pipeline time	~13s (dominated by LLM calls)

Status

Executor protocol + MontyExecutor + DefaultExecutor
CodeModeGenerator two-phase pipeline
Logfire instrumentation (every span: explore, execute_snippet, generate_spec, run_evals)
Full test suite (56 tests covering all injection modes and parity)
Integration into TDD pipeline (TDDGenerator)
Integration into optimization loop (vowel-optimization)
Self-correction: re-explore when eval cases fail

Install

pip install "vowel[monty]"  # includes pydantic-monty

Quick Example

from vowel.codemode import CodeModeGenerator
from vowel.runner import Function

gen = CodeModeGenerator(model="openai:gpt-4o")
func = Function(
    name="binary_search",
    description="Binary search in sorted list, returns index or -1",
    code="def binary_search(arr, target): ..."
)

result = await gen.generate(func, run_evals=True)
print(result.yaml_spec)       # verified YAML spec
print(result.summary)         # eval run results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeMode support for AI-driven Eval Generation via Pydantic Monty #5

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

CodeMode support for AI-driven Eval Generation via Pydantic Monty #5

Uh oh!

fswair Mar 2, 2026 Maintainer

CodeMode: AI-Driven Eval Generation via Pydantic Monty

Problem

Solution: CodeMode

Phase 1 — Exploration

Phase 2 — Spec Generation

Sandbox: Pydantic Monty

Executor Protocol

How This Strengthens TDD

Observed Results

Status

Install

Quick Example

Replies: 0 comments

fswair
Mar 2, 2026
Maintainer