Skip to content

Implement pixie.wrap() API and eval process redesign#11

Merged
yiouli merged 8 commits intomainfrom
copilot/implement-specifications-redesign
Apr 6, 2026
Merged

Implement pixie.wrap() API and eval process redesign#11
yiouli merged 8 commits intomainfrom
copilot/implement-specifications-redesign

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 6, 2026

Implements the wrap() data-oriented observation API and the corresponding eval runner, registry, and serialization changes specified in specs/wrap-api-and-eval-redesign.md.

Core wrap() API

New pixie.wrap() function with three behavior modes:

# No-op (default production) — zero overhead
result = pixie.wrap(fetch_profile(user_id), purpose="input", name="user_profile")

# Eval mode (registry active) — injects dependency, captures output
profile = pixie.wrap(fetch_profile, purpose="input", name="user_profile")  # returns injected value
response = pixie.wrap(llm_response, purpose="output", name="response")     # captured for evaluators

# Tracing mode (PIXIE_TRACING=1) — emits OTel event + writes to trace file

New files

  • pixie/instrumentation/wrap_registry.pyContextVar-based input/capture registries with separate output and state registries; concurrent async eval runs get isolated state
  • pixie/instrumentation/wrap_serialization.py — jsonpickle encode/decode helpers (type-preserving, not plain JSON)
  • pixie/instrumentation/wrap.pywrap() + WrapRegistryMissError / WrapTypeMismatchError
  • pixie/instrumentation/wrap_log.pyWrapLogEntry pydantic model + load_wrap_log_entries() and filter_by_purpose() utilities for loading JSONL trace data into typed objects
  • pixie/instrumentation/trace_writer.py — thread-safe JSONL writer for wrap events and LLM spans

Config changes

Two new env vars read into PixieConfig:

  • PIXIE_TRACING — enables tracing mode (1/true/yes/on)
  • PIXIE_TRACE_OUTPUT — path for JSONL trace output file

Trace writer integration

enable_storage() creates a TraceFileWriter when both vars are set. wrap() and LLMSpanProcessor.on_end() write to it when active.

OTel event emission

_emit_wrap_event() adds events to the active span when one exists. When no recording span is active (the common case for wrap() calls outside OTel-instrumented code), it falls back to the Python logger, which is captured by OTel log exporters if configured.

Dataset format

Dataset eval_input is an array of wrap log entry objects matching the JSONL trace file format exactly. Each entry has type, name, purpose, and data fields:

{
  "eval_input": [
    {"type": "wrap", "name": "user_message", "purpose": "entry", "data": "Hello"},
    {"type": "wrap", "name": "customer_profile", "purpose": "input", "data": {"id": "C001", "name": "Alice"}}
  ],
  "expected_output": "Hello Alice! How can I help you?"
}

The WrapLogEntry pydantic model in pixie/instrumentation/wrap_log.py provides the shared typed representation used by both the trace filter CLI and the dataset/test runner.

Evaluable model

  • eval_input holds the array of WrapLogEntry-shaped dicts
  • get_wrap_inputs() helper parses and validates the wrap log entry format
  • captured_output: dict[str, JsonValue] | None — values from wrap(purpose="output")
  • captured_state: dict[str, JsonValue] | None — values from wrap(purpose="state")

Test runner

_run_entry detects wrap-mode datasets via get_wrap_inputs(). In wrap mode, entry-purpose items are aggregated by name into a dict passed to the runnable, and input-purpose items populate the input registry. Separate output and state capture registries ensure eval_output reflects only purpose="output" data.

CLI: pixie trace filter

uv run pixie trace filter trace.jsonl --purpose entry,input

Filters a JSONL trace file using WrapLogEntry model for typed parsing; outputs matching entries as JSON.

Manual test fixtures

  • tests/manual/chatbot.py — CLI chatbot with fake database lookup (tool calling), using wrap() for every purpose type (entry, input, output, state) on both data values and callable providers. The chat() entry point serves as the runnable for evaluation.
  • tests/manual/verify_wrap_e2e.py — End-to-end verification script that:
    1. Runs the chatbot with tracing enabled and trace file configured
    2. Validates trace log content (all purpose types, expected wrap names)
    3. Creates a dataset from filtered trace log (purpose=entry/input as eval_input, purpose=output for expected_output)
    4. Runs pixie test on the generated dataset
    5. Validates test results (all 3 conversations pass at 1.00 score)
Original prompt

implement according to spec

The user has attached the following file paths as relevant context:

  • specs/wrap-api-and-eval-redesign.md
  • .github/copilot-instructions.md
  • .github/copilot-instructions.md
  • CLAUDE.md

Created from VS Code.

Copilot AI and others added 5 commits April 6, 2026 04:33
- Add pixie/instrumentation/wrap_registry.py: context-var registries
  for input injection (_input_registry) and output capture
  (_capture_registry) with set/get/clear/init helpers

- Add pixie/instrumentation/wrap_serialization.py: jsonpickle-based
  serialize_wrap_data() / deserialize_wrap_data() helpers

- Add pixie/instrumentation/wrap.py: wrap() function with three modes:
  - No-op: returns data unchanged (default, no overhead)
  - Tracing (PIXIE_TRACING=1): emits OTel span events
  - Eval (input registry active): injects inputs, captures outputs/state
  WrapRegistryMissError raised when input key missing in eval mode.

- Add pixie/instrumentation/trace_writer.py: thread-safe JSONL trace
  file writer (TraceFileWriter) for wrap events and LLM spans

- Update pixie/config.py: add trace_output and tracing_enabled fields
  to PixieConfig; read PIXIE_TRACE_OUTPUT and PIXIE_TRACING env vars

- Update pixie/instrumentation/__init__.py and pixie/__init__.py:
  export wrap, WrapRegistryMissError, WrapTypeMismatchError and all
  registry helper functions

- Add tests: test_wrap_registry, test_wrap_serialization, test_wrap,
  test_config_tracing (36 new tests, 676 total passing)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>
- Add module docstring to wrap_registry.py
- Fix 'behaviour' -> 'behavior' spelling in wrap.py module docstring
- Remove unused 'os' import in test_wrap.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement according to specifications for API and eval redesign Implement pixie.wrap() API and eval process redesign Apr 6, 2026
Copilot AI requested a review from yiouli April 6, 2026 04:48
@yiouli
Copy link
Copy Markdown
Owner

yiouli commented Apr 6, 2026

@copilot You are making the wrong assumption of the dataset format.

The new dataset format would have the eval_input (app_input) in the exact same format of the trace logging - array of json objects for wrap logged data objects whose purpose is either entry or input.

There's no backward compatibility; your implementation to get entry point input & dependency input are both wrong.

Implementation-wise, because multiple places would need to load the jsonl (wrap logged objects), you'd need to create a common utility for loading that data into typed objects (pydantic model probably).

Again because we don't need backward compatibility, you should just update the evaluable to directly have the list[LoggedData] for eval input.

Also, because there's likely not going to be a parent span when wrap is called, you need to use oTel event/logging API, which would still work when there's no parent span, to log the wrap events. I'm not sure your current implementation would actually log the events when there's no parent span.

And finally, you need to revamp the manual test fixture under tests/manual/, it should include:

  • a basic CLI chatbot implementation, with tool calling (fake database lookup); multiple wrap calls should be placed in the app in reasonable places, and include usage of every type of purpose, as well as both on a data value and a data provider function. The entry point of the CLI chatbot should be usable as the runnable for evaluation.

  • the same mock evaluators

  • a script that does the following steps and verify that each step is working as expected:

    1. run the runnable with tracing enabled & tracing file configured
    2. load the trace log file, validate the content in it is as expected
    3. create the dataset directly from the filtered tracing log (array of json object, purpose=entry/input) as the eval input, expected_output derived from filtered tracing log (purpose =output) and configure the other fields properly (runnable, evaluators, description etc)
    4. call the pixie test command on the dataset
    5. verify the test result is as expected

The script should print detailed error message if any of its validation fails. You need to use that script to verify that your code is working.

@yiouli yiouli marked this pull request as ready for review April 6, 2026 05:23
Copilot AI and others added 2 commits April 6, 2026 05:33
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot You are making the wrong assumption of the dataset format.

The new dataset format would have the eval_input (app_input) in the exact same format of the trace logging - arra...

Addressed all the feedback in commits 95681f5 and 0d03304:

  1. Dataset format: eval_input is now an array of wrap log entry objects matching the trace file format exactly. Removed entry_input/dependency_input fields.

  2. Common utility: Created pixie/instrumentation/wrap_log.py with WrapLogEntry pydantic model + load_wrap_log_entries() and filter_by_purpose() — shared by trace filter CLI and dataset loader.

  3. Evaluable: Added get_wrap_inputs() to parse eval_input as list[WrapLogEntry]. Split capture registries into output/state so eval_output gets the correct data.

  4. OTel logging: _emit_wrap_event() falls back to Python logger (picked up by OTel log exporters if configured) when no active span exists.

  5. Manual test fixtures:

    • tests/manual/chatbot.py — CLI chatbot with fake DB lookup, wrap() on every purpose type (entry, input, output, state), both data values and callables
    • tests/manual/verify_wrap_e2e.py — 5-step verification: run with tracing → validate trace → create dataset from filtered trace → pixie test → validate results

All 3 conversations pass at 1.00 score. 676 unit tests pass, mypy/ruff clean.

@yiouli yiouli merged commit 10b32f0 into main Apr 6, 2026
@yiouli yiouli deleted the copilot/implement-specifications-redesign branch April 6, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants