Skip to content

feat: add AgentRollout seed source with lazy manifest/hydrate architecture#399

Open
eric-tramel wants to merge 15 commits intomainfrom
feature/trace-directory-normalizers-v1
Open

feat: add AgentRollout seed source with lazy manifest/hydrate architecture#399
eric-tramel wants to merge 15 commits intomainfrom
feature/trace-directory-normalizers-v1

Conversation

@eric-tramel
Copy link
Contributor

@eric-tramel eric-tramel commented Mar 11, 2026

Summary

Add a single AgentRolloutSeedSource and AgentRolloutSeedReader that normalize agent rollout traces from multiple formats (Claude Code, Codex) into a common row shape for use in prompts, expressions, and downstream curation workflows.

Built on top of the FileSystemSeedReader manifest/hydrate architecture from #421, with the manifest phase doing cheap file discovery and the hydrate phase doing per-file parsing with 1:many fanout.

Dependency

What This Adds

  • AgentRolloutSeedSource(format=AgentRolloutFormat.CLAUDE_CODE) — config
  • AgentRolloutSeedReader — lazy manifest/hydrate reader
  • AgentRolloutFormat enum — claude_code, codex
  • Per-format handlers with is_handled_file() / parse_file() interface

Quick Start

Claude Code with the built-in default path:

import data_designer.config as dd

seed_source = dd.AgentRolloutSeedSource(
    format=dd.AgentRolloutFormat.CLAUDE_CODE,
)

Codex with the built-in default path:

import data_designer.config as dd

seed_source = dd.AgentRolloutSeedSource(
    format=dd.AgentRolloutFormat.CODEX,
)

You can override the path explicitly for any format:

import data_designer.config as dd

seed_source = dd.AgentRolloutSeedSource(
    path="trace-data/codex",
    format=dd.AgentRolloutFormat.CODEX,
)

What You Get

Each provider-specific rollout file is normalized into a shared seed row shape:

Field Meaning
trace_id Stable identifier for the normalized rollout row.
source_kind Rollout format/provider, such as claude_code or codex.
source_path Original file path on disk for the rollout artifact.
root_session_id Root session identifier for the rollout.
agent_id Agent identifier when the source format exposes one.
is_sidechain Whether the rollout represents a delegated/sub-agent branch.
cwd Working directory captured for the rollout.
project_path Project or repository path associated with the rollout.
git_branch Git branch captured from the rollout metadata, when available.
started_at Start timestamp for the rollout/session.
ended_at End timestamp for the rollout/session.
messages Normalized conversation/tool transcript payload.
message_count Number of normalized messages in the rollout.
tool_call_count Count of tool calls observed in the rollout.
final_assistant_message Final assistant message extracted from the rollout, when available.
source_meta Provider-specific metadata preserved alongside the normalized row.

What This Unlocks

Once attached as a seed dataset, rollout rows can drive prompt- and expression-based generation directly:

import data_designer.config as dd
from data_designer.interface import DataDesigner

config_builder = dd.DataDesignerConfigBuilder(model_configs=[
    dd.ModelConfig(
        alias="nvidia-super",
        model="nvidia/nemotron-3-super-120b-a12b",
        provider="nvidia",
    )
])

config_builder.with_seed_dataset(
    dd.AgentRolloutSeedSource(
        format=dd.AgentRolloutFormat.CODEX,
    )
)

config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="trace_summary",
        model_alias="nvidia-super",
        prompt="""Summarize this rollout as a reusable engineering task.

source_kind: {{ source_kind }}
trace_id: {{ trace_id }}
cwd: {{ cwd }}
git_branch: {{ git_branch }}
message_count: {{ message_count }}
tool_call_count: {{ tool_call_count }}
final_assistant_message: {{ final_assistant_message }}
""",
    )
)

results = DataDesigner().preview(config_builder, num_records=5)

This PR also includes a generic rollout-distillation recipe that turns rollout rows into SFT candidates:

uv run agent_rollout_distillation.py --format claude_code --preview
uv run agent_rollout_distillation.py --format codex --shuffle --num-records 20

Defaults

  • AgentRolloutSeedSource(format=AgentRolloutFormat.CLAUDE_CODE) defaults to ~/.claude/projects
  • AgentRolloutSeedSource(format=AgentRolloutFormat.CODEX) defaults to ~/.codex/sessions
  • Built-in rollout formats default file_pattern to *.jsonl
  • path and file_pattern stay None in serialized form when omitted (no baked-in machine-specific defaults)

Design

Architecture

  • Rollout ingestion is modeled as a built-in filesystem seed reader on top of feat: add built-in filesystem seed readers #421's manifest-first architecture
  • build_manifest() does cheap file discovery only — no file reads, no JSON parsing
  • hydrate_row() does per-file parsing with 1:many fanout (one file can produce many normalized rows)
  • get_seed_dataset_size() returns the file count (manifest row count), not the parsed record count
  • The HydratingSeedReaderBatchReader handles fanout transparently — num_records in DataDesigner.create() works correctly because the engine fetches batches until the target record count is met

Config

  • AgentRolloutSeedSource uses format: AgentRolloutFormat (a plain enum) instead of a nested format config hierarchy
  • Default path and file-pattern resolution happens at runtime via resolved_file_pattern and get_agent_rollout_format_defaults(), keeping serialized configs declarative

Module boundaries

  • Each format handler owns one parse_file() entrypoint (per-file, not per-directory)
  • The reader owns file discovery (get_matching_relative_paths + is_handled_file), manifest construction, shared context (Claude session index), and error normalization into SeedReaderError
  • Claude session index scanning respects the recursive setting on the source
  • Shared rollout normalization helpers live in agent_rollout_seed_parser.py
  • Provider-specific parsing lives in agent_rollout_format_handlers.py

Fault Tolerance

  • Empty matched files are skipped with warnings during hydration
  • Malformed matched files are skipped with warnings during hydration
  • Unhandled matched files are skipped with warnings during manifest construction
  • OSError during hydration is caught and wrapped as SeedReaderError

Docs and Recipe

  • Updates the seed-dataset concept docs to document AgentRolloutSeedSource and the format= API
  • Includes a generic agent_rollout_distillation.py recipe driven by --format and --trace-dir
  • The recipe derives trace_digest, emits a standalone sft_record, runs an sft_quality_judge_result, and computes recommended_for_sft

Test Plan

  • make check-all — all lint and format checks pass
  • make test — all tests pass (config + engine + interface)
  • Config tests: round-trip serialization, default path resolution, file pattern validation
  • Engine tests: manifest laziness (mocked file reads prove they happen in hydration, not manifest), file-count-based get_seed_dataset_size(), OSError wrapping, Claude session index recursive setting
  • Interface e2e tests: both formats (Claude Code, Codex), skip/malformed/unhandled files, all-files-invalid error
  • Recipe verified: uv run agent_rollout_distillation.py --format claude_code --num-records 2 --preview

🤖 Generated with Claude Code

@eric-tramel eric-tramel force-pushed the feature/directory-seed-transforms-v1 branch from 1e62c42 to e1aa97b Compare March 12, 2026 14:01
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 4ba6373 to 6c91a06 Compare March 12, 2026 14:22
@eric-tramel eric-tramel self-assigned this Mar 13, 2026
@eric-tramel eric-tramel added enhancement New feature or request labels Mar 13, 2026
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 6c91a06 to 8b0999a Compare March 16, 2026 16:00
@eric-tramel eric-tramel changed the title feat: add built-in trace directory normalizers feat: add built-in trace seed sources Mar 16, 2026
@eric-tramel eric-tramel changed the base branch from feature/directory-seed-transforms-v1 to feature/filesystem-seed-readers-v1 March 16, 2026 16:00
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from dd07c95 to 909ff76 Compare March 16, 2026 17:08
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from b2b95cd to 1913096 Compare March 18, 2026 02:10
@eric-tramel eric-tramel changed the base branch from feature/filesystem-seed-readers-v1 to main March 18, 2026 12:31
@eric-tramel eric-tramel changed the title feat: add built-in trace seed sources feat: add AgentRollout seed source and formats Mar 18, 2026
@eric-tramel eric-tramel marked this pull request as ready for review March 18, 2026 13:18
@eric-tramel eric-tramel requested a review from a team as a code owner March 18, 2026 13:18
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR introduces AgentRolloutSeedSource and AgentRolloutSeedReader, a lazy manifest/hydrate filesystem seed reader that normalises Claude Code and Codex agent rollout traces into a common row schema for use in prompts, expression columns, and SFT distillation pipelines. It builds cleanly on top of the FileSystemSeedReader base from #421 and ships with a full distillation recipe.

Key additions:

  • AgentRolloutFormat enum (claude_code, codex), AgentRolloutSeedSource config, and AgentRolloutSeedReader engine class
  • Per-format handlers (ClaudeCodeAgentRolloutFormatHandler, CodexAgentRolloutFormatHandler) with is_handled_file / parse_file interface
  • Shared normalisation helpers in agent_rollout_seed_parser.py (message role mapping, content-block coercion, session-index loading)
  • agent_rollout_distillation.py recipe for trace → SFT pipeline with LLM judge scoring
  • Docs updates for DirectorySeedSource, FileContentsSeedSource, and the new AgentRolloutSeedSource

Issues found:

  • The PR description prominently documents a third format value CHAT_COMPLETION_JSONL (including a Quick Start code snippet), but it is entirely absent from the AgentRolloutFormat enum, the handler registry, get_agent_rollout_format_defaults, and all tests. If deferred, the PR description should be updated.
  • validate_resolved_path_exists assigns self._runtime_path after validating the resolved path, but the inherited FileSystemSeedSource.model_post_init (which pydantic v2 calls after all validators) resets it to None when path is None. The runtime_path property handles this via lazy resolution so there is no runtime bug, but the assignment in the validator is dead work and the interaction is subtle.
  • test_claude_session_index_scanning_respects_recursive_false does not verify that recursive=False prevents scanning into subdirectories — both the recursive and non-recursive readers in the test find the same index file at the same level, producing identical results.

Confidence Score: 4/5

  • Safe to merge with minor issues; the main blocker is a PR description/code mismatch for CHAT_COMPLETION_JSONL that should be clarified before merge.
  • The core manifest/hydrate architecture is sound, the format handlers are well-structured, and test coverage is thorough for the two implemented formats. No runtime bugs were found. Score is 4 rather than 5 because of the CHAT_COMPLETION_JSONL discrepancy between the PR description and the code (which could land confusing docs if merged as-is), the subtle model_post_init interaction that leaves dead code in the validator, and the test gap for recursive=False isolation.
  • packages/data-designer-config/src/data_designer/config/seed_source.py (AgentRolloutFormat enum and validate_resolved_path_exists), packages/data-designer-engine/tests/engine/resources/test_seed_reader.py (recursive=False test coverage)

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout_seed_parser.py New file: shared normalization helpers for all agent rollout formats. Well-structured with clear error handling, safe coerce_optional_str wrappers, and properly isolated helpers. No blocking issues.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py New file: per-format handlers for Claude Code and Codex. CHAT_COMPLETION_JSONL is mentioned in the PR description but has no corresponding handler here. Minor session_index None-guard redundancy in ClaudeCode handler.
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py Adds AgentRolloutSeedReader with correct manifest/hydrate split, lazy reader_context caching, and SeedReaderError wrapping for OSErrors. Logic is sound.
packages/data-designer-config/src/data_designer/config/seed_source.py Adds AgentRolloutSeedSource with optional path/file_pattern, model validator for resolved-path existence check, and runtime_path property with lazy default resolution. validate_resolved_path_exists redundantly re-validates an explicit path already validated by the field validator, and its _runtime_path assignment is overwritten by the inherited model_post_init.
packages/data-designer-engine/tests/engine/resources/test_seed_reader.py Good new coverage for manifest laziness, OSError wrapping, file-count sizing, and session-index scanning. The test_claude_session_index_scanning_respects_recursive_false test doesn't actually verify that recursive=False prevents scanning into subdirectories of root_path.
docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py Comprehensive SFT distillation recipe with judge scoring and partition support. parse_args() naming is already flagged in prior threads.
packages/data-designer-config/tests/config/test_seed_source.py Adds AgentRolloutSeedSource config tests: round-trip serialization, default path resolution, and file-pattern validation. Coverage looks solid.
packages/data-designer/tests/interface/test_data_designer.py Adds interface-level e2e tests. No issues observed in the diff.

Sequence Diagram

sequenceDiagram
    participant User
    participant AgentRolloutSeedSource
    participant AgentRolloutSeedReader
    participant FormatHandler
    participant Parser

    User->>AgentRolloutSeedSource: AgentRolloutSeedSource(format=CLAUDE_CODE)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_path (field validator)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_resolved_path_exists (model validator)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: model_post_init (resets _runtime_path if path=None)

    User->>AgentRolloutSeedReader: attach(source, resolver)
    User->>AgentRolloutSeedReader: get_seed_dataset_size()
    AgentRolloutSeedReader->>FormatHandler: get_format_handler()
    AgentRolloutSeedReader->>AgentRolloutSeedReader: build_manifest() — file discovery only, no reads
    AgentRolloutSeedReader-->>User: file count (manifest rows)

    User->>AgentRolloutSeedReader: create_batch_reader().read_next_batch()
    AgentRolloutSeedReader->>AgentRolloutSeedReader: _get_reader_context() — loads Claude session index once
    loop for each manifest row
        AgentRolloutSeedReader->>FormatHandler: hydrate_row(manifest_row)
        FormatHandler->>Parser: parse_file(root_path, relative_path, reader_context)
        Parser->>Parser: load_jsonl_rows() — reads file
        Parser->>Parser: normalize messages (Claude/Codex specific)
        Parser->>Parser: build_agent_rollout_record()
        Parser-->>FormatHandler: list[NormalizedAgentRolloutRecord]
        FormatHandler-->>AgentRolloutSeedReader: list[dict] (1:many fanout)
    end
    AgentRolloutSeedReader-->>User: Arrow batch (record count >= num_records)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 196

Comment:
**`AgentRolloutFormat` missing `CHAT_COMPLETION_JSONL` described in PR summary**

The PR description explicitly lists `chat_completion_jsonl` as a supported `AgentRolloutFormat` value and provides a full Quick Start example for it:

```python
seed_source = dd.AgentRolloutSeedSource(
    path="trace-data/chat-completions",
    format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL,
)
```

However, the actual enum only defines two values:

```python
class AgentRolloutFormat(StrEnum):
    CLAUDE_CODE = "claude_code"
    CODEX = "codex"
```

`CHAT_COMPLETION_JSONL` is absent from the enum, `get_agent_rollout_format_defaults`, `BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS`, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 231-241

Comment:
**Redundant path validation and overwritten `_runtime_path` assignment in model validator**

`validate_resolved_path_exists` calls `_validate_filesystem_seed_source_path(resolved_path)` for an explicit path that has already been validated by the `validate_path` field validator just above it. The double-check is harmless but unnecessary.

More subtly: the assignment `self._runtime_path = _resolve_filesystem_runtime_path(resolved_path)` on line 239 is overwritten by the inherited `FileSystemSeedSource.model_post_init`, which pydantic v2 calls _after_ all model validators. When `self.path is None`, `model_post_init` resets `_runtime_path` back to `None`. The `runtime_path` property correctly compensates with lazy resolution, so there is no runtime bug — but the assignment in the validator is dead work for the default-path case.

Consider either removing the `_runtime_path` assignment from `validate_resolved_path_exists` (relying entirely on the property's lazy path), or overriding `model_post_init` in `AgentRolloutSeedSource` to set the cache correctly after validation completes:

```python
def model_post_init(self, __context: Any) -> None:
    default_path, _ = get_agent_rollout_format_defaults(self.format)
    resolved_path = self.path or default_path
    self._runtime_path = _resolve_filesystem_runtime_path(resolved_path)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/resources/test_seed_reader.py
Line: 946-980

Comment:
**Test doesn't actually verify that `recursive=False` prevents scanning into subdirectories**

The non-recursive reader is attached with `path=str(session_dir)`, so its `root_path` is already the leaf directory (`project-a/`). The `sessions-index.json` is located _directly_ in `session_dir`, not in a deeper subdirectory. Both `glob("sessions-index.json")` and `rglob("sessions-index.json")` find it at that level, so the recursive and non-recursive cases produce identical results and the assertion `list(non_recursive_df["project_path"]) == ["/from-nested-index"]` doesn't distinguish between the two modes.

To actually cover the `recursive=False` isolation contract, consider a setup where:
- `root_path = tmp_path` (not `session_dir`)
- `sessions-index.json` lives only in a subdirectory of `root_path`
- The non-recursive reader (pointed at `tmp_path`) should _not_ find the nested index and should fall back to `cwd`
- The recursive reader should find it

```python
# Non-recursive pointed at tmp_path should NOT pick up sessions-index.json
# nested inside project-a/ — project_path should fall back to cwd
reader_non_recursive.attach(
    AgentRolloutSeedSource(
        path=str(tmp_path),   # root is tmp_path, index is nested under project-a/
        format=AgentRolloutFormat.CLAUDE_CODE,
        file_pattern="*.jsonl",
        recursive=False,
    ),
    PlaintextResolver(),
)
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "refactor: remove cha..."

@eric-tramel eric-tramel changed the title feat: add AgentRollout seed source and formats feat: add AgentRollout seed source with lazy manifest/hydrate architecture Mar 19, 2026
eric-tramel and others added 5 commits March 19, 2026 09:25
…eader

Address all 6 findings from the architecture review of the agent rollout
seed reader:

1. (High) Preserve manifest/hydrate split: build_manifest() now does
   cheap file discovery only; hydrate_row() does per-file parsing with
   1:many fanout. Removes eager _normalized_records_by_locator cache.

2. (Medium) Simplify config surface: remove AgentRolloutFormatConfig
   hierarchy (5 classes), replace with format: AgentRolloutFormat enum.
   Serialized configs preserve None for path/file_pattern instead of
   baking in machine-specific defaults.

3. (Medium) Centralize Claude session index scanning in the reader via
   lazy AgentRolloutReaderContext. Respect recursive=False setting.

4. (Medium) Wrap OSError in hydrate_row() as SeedReaderError so file
   I/O errors don't leak past the seed-reader boundary.

5. (Medium) Make chat-completion file ingestion atomic — if any row
   fails to parse, the entire file is rejected.

6. (Low) Fix fallback trace_id from file_path.stem:line_number to
   relative_path:line_number to prevent collisions across same-stem
   files in different directories.

Adds targeted contract tests for manifest laziness, file-count-based
get_seed_dataset_size(), OSError wrapping, atomic file rejection,
trace_id collision prevention, and recursive session index scanning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eric-tramel eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 091ecff to 7d56607 Compare March 19, 2026 13:26
Comment on lines +411 to +456
choices=[rollout_format.value for rollout_format in dd.AgentRolloutFormat],
help="Built-in rollout format to read.",
)
parser.add_argument(
"--trace-dir",
type=Path,
default=None,
help=(
"Optional directory containing rollout JSONL files. When omitted, `claude_code` defaults to "
"~/.claude/projects and `codex` defaults to ~/.codex/sessions. `chat_completion_jsonl` "
"requires an explicit path."
),
)
parser.add_argument("--model-alias", type=str, default="nvidia-super")
parser.add_argument("--num-records", type=int, default=5)
parser.add_argument("--artifact-path", type=str, default=None)
parser.add_argument("--dataset-name", type=str, default="agent_rollout_trace_workflows")
parser.add_argument(
"--preview",
action="store_true",
help="Run the recipe in preview mode and keep the generated dataset in memory.",
)
parser.add_argument(
"--shuffle",
action="store_true",
help="Shuffle the normalized trace rows before sampling.",
)
parser.add_argument(
"--partition-index",
type=int,
default=None,
help="Optional partition index for large trace corpora.",
)
parser.add_argument(
"--num-partitions",
type=int,
default=None,
help="Optional total number of partitions for large trace corpora.",
)
return parser


def resolve_selection_strategy(
partition_index: int | None,
num_partitions: int | None,
) -> dd.PartitionBlock | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 parse_args returns an ArgumentParser, not parsed arguments

The function is named parse_args but returns the ArgumentParser object itself. Callers must chain a second .parse_args() call (parse_args().parse_args()), which is semantically surprising. The return type annotation (-> ArgumentParser) is technically accurate, but the naming convention parse_args strongly implies the returned value is the parsed Namespace.

Consider either renaming the function to build_arg_parser / create_parser, or having it return the parsed args directly:

Suggested change
choices=[rollout_format.value for rollout_format in dd.AgentRolloutFormat],
help="Built-in rollout format to read.",
)
parser.add_argument(
"--trace-dir",
type=Path,
default=None,
help=(
"Optional directory containing rollout JSONL files. When omitted, `claude_code` defaults to "
"~/.claude/projects and `codex` defaults to ~/.codex/sessions. `chat_completion_jsonl` "
"requires an explicit path."
),
)
parser.add_argument("--model-alias", type=str, default="nvidia-super")
parser.add_argument("--num-records", type=int, default=5)
parser.add_argument("--artifact-path", type=str, default=None)
parser.add_argument("--dataset-name", type=str, default="agent_rollout_trace_workflows")
parser.add_argument(
"--preview",
action="store_true",
help="Run the recipe in preview mode and keep the generated dataset in memory.",
)
parser.add_argument(
"--shuffle",
action="store_true",
help="Shuffle the normalized trace rows before sampling.",
)
parser.add_argument(
"--partition-index",
type=int,
default=None,
help="Optional partition index for large trace corpora.",
)
parser.add_argument(
"--num-partitions",
type=int,
default=None,
help="Optional total number of partitions for large trace corpora.",
)
return parser
def resolve_selection_strategy(
partition_index: int | None,
num_partitions: int | None,
) -> dd.PartitionBlock | None:
def parse_args() -> ArgumentParser:
def create_parser() -> ArgumentParser:

And updating the call site in main():

args = create_parser().parse_args()
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py
Line: 411-456

Comment:
**`parse_args` returns an `ArgumentParser`, not parsed arguments**

The function is named `parse_args` but returns the `ArgumentParser` object itself. Callers must chain a second `.parse_args()` call (`parse_args().parse_args()`), which is semantically surprising. The return type annotation (`-> ArgumentParser`) is technically accurate, but the naming convention `parse_args` strongly implies the returned value is the parsed `Namespace`.

Consider either renaming the function to `build_arg_parser` / `create_parser`, or having it return the parsed args directly:

```suggestion
def parse_args() -> ArgumentParser:
def create_parser() -> ArgumentParser:
```

And updating the call site in `main()`:

```python
args = create_parser().parse_args()
```

How can I resolve this? If you propose a fix, please make it concise.

…stion

The chat-completion format is underspecified and adds ~170 LOC of
format-specific code plus test/doc overhead. Deferring it to a future PR
keeps this one focused on the two well-defined agent harness formats
(Claude Code and Codex).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
class AgentRolloutFormat(StrEnum):
CLAUDE_CODE = "claude_code"
CODEX = "codex"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 AgentRolloutFormat missing CHAT_COMPLETION_JSONL described in PR summary

The PR description explicitly lists chat_completion_jsonl as a supported AgentRolloutFormat value and provides a full Quick Start example for it:

seed_source = dd.AgentRolloutSeedSource(
    path="trace-data/chat-completions",
    format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL,
)

However, the actual enum only defines two values:

class AgentRolloutFormat(StrEnum):
    CLAUDE_CODE = "claude_code"
    CODEX = "codex"

CHAT_COMPLETION_JSONL is absent from the enum, get_agent_rollout_format_defaults, BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 196

Comment:
**`AgentRolloutFormat` missing `CHAT_COMPLETION_JSONL` described in PR summary**

The PR description explicitly lists `chat_completion_jsonl` as a supported `AgentRolloutFormat` value and provides a full Quick Start example for it:

```python
seed_source = dd.AgentRolloutSeedSource(
    path="trace-data/chat-completions",
    format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL,
)
```

However, the actual enum only defines two values:

```python
class AgentRolloutFormat(StrEnum):
    CLAUDE_CODE = "claude_code"
    CODEX = "codex"
```

`CHAT_COMPLETION_JSONL` is absent from the enum, `get_agent_rollout_format_defaults`, `BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS`, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant