Implement `pixie.wrap()` API and eval process redesign by Copilot · Pull Request #11 · yiouli/pixie-qa

Copilot · 2026-04-06T04:21:04Z

Implements the wrap() data-oriented observation API and the corresponding eval runner, registry, and serialization changes specified in specs/wrap-api-and-eval-redesign.md.

Core `wrap()` API

New pixie.wrap() function with three behavior modes:

# No-op (default production) — zero overhead
result = pixie.wrap(fetch_profile(user_id), purpose="input", name="user_profile")

# Eval mode (registry active) — injects dependency, captures output
profile = pixie.wrap(fetch_profile, purpose="input", name="user_profile")  # returns injected value
response = pixie.wrap(llm_response, purpose="output", name="response")     # captured for evaluators

# Tracing mode (PIXIE_TRACING=1) — emits OTel event + writes to trace file

New files

pixie/instrumentation/wrap_registry.py — ContextVar-based input/capture registries with separate output and state registries; concurrent async eval runs get isolated state
pixie/instrumentation/wrap_serialization.py — jsonpickle encode/decode helpers (type-preserving, not plain JSON)
pixie/instrumentation/wrap.py — wrap() + WrapRegistryMissError / WrapTypeMismatchError
pixie/instrumentation/wrap_log.py — WrapLogEntry pydantic model + load_wrap_log_entries() and filter_by_purpose() utilities for loading JSONL trace data into typed objects
pixie/instrumentation/trace_writer.py — thread-safe JSONL writer for wrap events and LLM spans

Config changes

Two new env vars read into PixieConfig:

PIXIE_TRACING — enables tracing mode (1/true/yes/on)
PIXIE_TRACE_OUTPUT — path for JSONL trace output file

Trace writer integration

enable_storage() creates a TraceFileWriter when both vars are set. wrap() and LLMSpanProcessor.on_end() write to it when active.

OTel event emission

_emit_wrap_event() adds events to the active span when one exists. When no recording span is active (the common case for wrap() calls outside OTel-instrumented code), it falls back to the Python logger, which is captured by OTel log exporters if configured.

Dataset format

Dataset eval_input is an array of wrap log entry objects matching the JSONL trace file format exactly. Each entry has type, name, purpose, and data fields:

{
  "eval_input": [
    {"type": "wrap", "name": "user_message", "purpose": "entry", "data": "Hello"},
    {"type": "wrap", "name": "customer_profile", "purpose": "input", "data": {"id": "C001", "name": "Alice"}}
  ],
  "expected_output": "Hello Alice! How can I help you?"
}

The WrapLogEntry pydantic model in pixie/instrumentation/wrap_log.py provides the shared typed representation used by both the trace filter CLI and the dataset/test runner.

Evaluable model

eval_input holds the array of WrapLogEntry-shaped dicts
get_wrap_inputs() helper parses and validates the wrap log entry format
captured_output: dict[str, JsonValue] | None — values from wrap(purpose="output")
captured_state: dict[str, JsonValue] | None — values from wrap(purpose="state")

Test runner

_run_entry detects wrap-mode datasets via get_wrap_inputs(). In wrap mode, entry-purpose items are aggregated by name into a dict passed to the runnable, and input-purpose items populate the input registry. Separate output and state capture registries ensure eval_output reflects only purpose="output" data.

CLI: `pixie trace filter`

uv run pixie trace filter trace.jsonl --purpose entry,input

Filters a JSONL trace file using WrapLogEntry model for typed parsing; outputs matching entries as JSON.

Manual test fixtures

tests/manual/chatbot.py — CLI chatbot with fake database lookup (tool calling), using wrap() for every purpose type (entry, input, output, state) on both data values and callable providers. The chat() entry point serves as the runnable for evaluation.
tests/manual/verify_wrap_e2e.py — End-to-end verification script that:
1. Runs the chatbot with tracing enabled and trace file configured
2. Validates trace log content (all purpose types, expected wrap names)
3. Creates a dataset from filtered trace log (purpose=entry/input as eval_input, purpose=output for expected_output)
4. Runs pixie test on the generated dataset
5. Validates test results (all 3 conversations pass at 1.00 score)

Original prompt

implement according to spec

The user has attached the following file paths as relevant context:

specs/wrap-api-and-eval-redesign.md

.github/copilot-instructions.md

.github/copilot-instructions.md

CLAUDE.md

Created from VS Code.

- Add pixie/instrumentation/wrap_registry.py: context-var registries for input injection (_input_registry) and output capture (_capture_registry) with set/get/clear/init helpers - Add pixie/instrumentation/wrap_serialization.py: jsonpickle-based serialize_wrap_data() / deserialize_wrap_data() helpers - Add pixie/instrumentation/wrap.py: wrap() function with three modes: - No-op: returns data unchanged (default, no overhead) - Tracing (PIXIE_TRACING=1): emits OTel span events - Eval (input registry active): injects inputs, captures outputs/state WrapRegistryMissError raised when input key missing in eval mode. - Add pixie/instrumentation/trace_writer.py: thread-safe JSONL trace file writer (TraceFileWriter) for wrap events and LLM spans - Update pixie/config.py: add trace_output and tracing_enabled fields to PixieConfig; read PIXIE_TRACE_OUTPUT and PIXIE_TRACING env vars - Update pixie/instrumentation/__init__.py and pixie/__init__.py: export wrap, WrapRegistryMissError, WrapTypeMismatchError and all registry helper functions - Add tests: test_wrap_registry, test_wrap_serialization, test_wrap, test_config_tracing (36 new tests, 676 total passing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

- Add module docstring to wrap_registry.py - Fix 'behaviour' -> 'behavior' spelling in wrap.py module docstring - Remove unused 'os' import in test_wrap.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/b85ebaff-6dd1-48d0-9857-6ad238bd533a Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

…, wrap docstring Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/b85ebaff-6dd1-48d0-9857-6ad238bd533a Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

yiouli · 2026-04-06T05:23:23Z

@copilot You are making the wrong assumption of the dataset format.

The new dataset format would have the eval_input (app_input) in the exact same format of the trace logging - array of json objects for wrap logged data objects whose purpose is either entry or input.

There's no backward compatibility; your implementation to get entry point input & dependency input are both wrong.

Implementation-wise, because multiple places would need to load the jsonl (wrap logged objects), you'd need to create a common utility for loading that data into typed objects (pydantic model probably).

Again because we don't need backward compatibility, you should just update the evaluable to directly have the list[LoggedData] for eval input.

Also, because there's likely not going to be a parent span when wrap is called, you need to use oTel event/logging API, which would still work when there's no parent span, to log the wrap events. I'm not sure your current implementation would actually log the events when there's no parent span.

And finally, you need to revamp the manual test fixture under tests/manual/, it should include:

a basic CLI chatbot implementation, with tool calling (fake database lookup); multiple wrap calls should be placed in the app in reasonable places, and include usage of every type of purpose, as well as both on a data value and a data provider function. The entry point of the CLI chatbot should be usable as the runnable for evaluation.
the same mock evaluators
a script that does the following steps and verify that each step is working as expected:
1. run the runnable with tracing enabled & tracing file configured
2. load the trace log file, validate the content in it is as expected
3. create the dataset directly from the filtered tracing log (array of json object, purpose=entry/input) as the eval input, expected_output derived from filtered tracing log (purpose =output) and configure the other fields properly (runnable, evaluators, description etc)
4. call the pixie test command on the dataset
5. verify the test result is as expected

The script should print detailed error message if any of its validation fails. You need to use that script to verify that your code is working.

…nual fixtures Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/828ad56d-c97d-4f82-894b-c03ea1360241 Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/828ad56d-c97d-4f82-894b-c03ea1360241 Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

Copilot · 2026-04-06T05:36:23Z

@copilot You are making the wrong assumption of the dataset format.

The new dataset format would have the eval_input (app_input) in the exact same format of the trace logging - arra...

Addressed all the feedback in commits 95681f5 and 0d03304:

Dataset format: eval_input is now an array of wrap log entry objects matching the trace file format exactly. Removed entry_input/dependency_input fields.
Common utility: Created pixie/instrumentation/wrap_log.py with WrapLogEntry pydantic model + load_wrap_log_entries() and filter_by_purpose() — shared by trace filter CLI and dataset loader.
Evaluable: Added get_wrap_inputs() to parse eval_input as list[WrapLogEntry]. Split capture registries into output/state so eval_output gets the correct data.
OTel logging: _emit_wrap_event() falls back to Python logger (picked up by OTel log exporters if configured) when no active span exists.
Manual test fixtures:
- tests/manual/chatbot.py — CLI chatbot with fake DB lookup, wrap() on every purpose type (entry, input, output, state), both data values and callables
- tests/manual/verify_wrap_e2e.py — 5-step verification: run with tracing → validate trace → create dataset from filtered trace → pixie test → validate results

All 3 conversations pass at 1.00 score. 676 unit tests pass, mypy/ruff clean.

Initial plan

5bbcf38

Copilot AI assigned Copilot and yiouli Apr 6, 2026

Copilot started work on behalf of yiouli April 6, 2026 04:21 View session

Copilot AI and others added 5 commits April 6, 2026 04:33

feat: implement wrap API, trace writer, dataset runner changes per spec

4385dc1

Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/b85ebaff-6dd1-48d0-9857-6ad238bd533a Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

fix: address code review comments on wrap API implementation

76ef8f7

Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/b85ebaff-6dd1-48d0-9857-6ad238bd533a Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

fix: final review comment fixes - error message, trace filter warning…

d0bd2d8

…, wrap docstring Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/b85ebaff-6dd1-48d0-9857-6ad238bd533a Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Implement according to specifications for API and eval redesign~~ Implement pixie.wrap() API and eval process redesign Apr 6, 2026

Copilot AI requested a review from yiouli April 6, 2026 04:48

Copilot finished work on behalf of yiouli April 6, 2026 04:48

yiouli marked this pull request as ready for review April 6, 2026 05:23

Copilot started work on behalf of yiouli April 6, 2026 05:23 View session

Copilot AI and others added 2 commits April 6, 2026 05:33

refactor: address reviewer feedback on wrap API dataset format and ma…

95681f5

…nual fixtures Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/828ad56d-c97d-4f82-894b-c03ea1360241 Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

fix: address code review feedback on wrap API

0d03304

Agent-Logs-Url: https://github.com/yiouli/pixie-qa/sessions/828ad56d-c97d-4f82-894b-c03ea1360241 Co-authored-by: yiouli <3508494+yiouli@users.noreply.github.com>

Copilot finished work on behalf of yiouli April 6, 2026 05:38

yiouli merged commit 10b32f0 into main Apr 6, 2026

yiouli deleted the copilot/implement-specifications-redesign branch April 6, 2026 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `pixie.wrap()` API and eval process redesign#11

Implement `pixie.wrap()` API and eval process redesign#11
yiouli merged 8 commits intomainfrom
copilot/implement-specifications-redesign

Copilot AI commented Apr 6, 2026 •

edited

Loading

Uh oh!

yiouli commented Apr 6, 2026

Uh oh!

Copilot AI commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core wrap() API

New files

Config changes

Trace writer integration

OTel event emission

Dataset format

Evaluable model

Test runner

CLI: pixie trace filter

Manual test fixtures

Uh oh!

yiouli commented Apr 6, 2026

Uh oh!

Copilot AI commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 6, 2026 •

edited

Loading

Core `wrap()` API

CLI: `pixie trace filter`