feat: add AgentRollout seed source with lazy manifest/hydrate architecture by eric-tramel · Pull Request #399 · NVIDIA-NeMo/DataDesigner

eric-tramel · 2026-03-11T19:30:17Z

Summary

Add a single AgentRolloutSeedSource and AgentRolloutSeedReader that normalize agent rollout traces from multiple formats (Claude Code, Codex) into a common row shape for use in prompts, expressions, and downstream curation workflows.

Built on top of the FileSystemSeedReader manifest/hydrate architecture from #421, with the manifest phase doing cheap file discovery and the hydrate phase doing per-file parsing with 1:many fanout.

Dependency

Depends on feat: add built-in filesystem seed readers #421

What This Adds

AgentRolloutSeedSource(format=AgentRolloutFormat.CLAUDE_CODE) — config
AgentRolloutSeedReader — lazy manifest/hydrate reader
AgentRolloutFormat enum — claude_code, codex
Per-format handlers with is_handled_file() / parse_file() interface

Quick Start

Claude Code with the built-in default path:

import data_designer.config as dd

seed_source = dd.AgentRolloutSeedSource(
    format=dd.AgentRolloutFormat.CLAUDE_CODE,
)

Codex with the built-in default path:

import data_designer.config as dd

seed_source = dd.AgentRolloutSeedSource(
    format=dd.AgentRolloutFormat.CODEX,
)

You can override the path explicitly for any format:

import data_designer.config as dd

seed_source = dd.AgentRolloutSeedSource(
    path="trace-data/codex",
    format=dd.AgentRolloutFormat.CODEX,
)

What You Get

Each provider-specific rollout file is normalized into a shared seed row shape:

Field	Meaning
`trace_id`	Stable identifier for the normalized rollout row.
`source_kind`	Rollout format/provider, such as `claude_code` or `codex`.
`source_path`	Original file path on disk for the rollout artifact.
`root_session_id`	Root session identifier for the rollout.
`agent_id`	Agent identifier when the source format exposes one.
`is_sidechain`	Whether the rollout represents a delegated/sub-agent branch.
`cwd`	Working directory captured for the rollout.
`project_path`	Project or repository path associated with the rollout.
`git_branch`	Git branch captured from the rollout metadata, when available.
`started_at`	Start timestamp for the rollout/session.
`ended_at`	End timestamp for the rollout/session.
`messages`	Normalized conversation/tool transcript payload.
`message_count`	Number of normalized messages in the rollout.
`tool_call_count`	Count of tool calls observed in the rollout.
`final_assistant_message`	Final assistant message extracted from the rollout, when available.
`source_meta`	Provider-specific metadata preserved alongside the normalized row.

What This Unlocks

Once attached as a seed dataset, rollout rows can drive prompt- and expression-based generation directly:

import data_designer.config as dd
from data_designer.interface import DataDesigner

config_builder = dd.DataDesignerConfigBuilder(model_configs=[
    dd.ModelConfig(
        alias="nvidia-super",
        model="nvidia/nemotron-3-super-120b-a12b",
        provider="nvidia",
    )
])

config_builder.with_seed_dataset(
    dd.AgentRolloutSeedSource(
        format=dd.AgentRolloutFormat.CODEX,
    )
)

config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="trace_summary",
        model_alias="nvidia-super",
        prompt="""Summarize this rollout as a reusable engineering task.

source_kind: {{ source_kind }}
trace_id: {{ trace_id }}
cwd: {{ cwd }}
git_branch: {{ git_branch }}
message_count: {{ message_count }}
tool_call_count: {{ tool_call_count }}
final_assistant_message: {{ final_assistant_message }}
""",
    )
)

results = DataDesigner().preview(config_builder, num_records=5)

This PR also includes a generic rollout-distillation recipe that turns rollout rows into SFT candidates:

uv run agent_rollout_distillation.py --format claude_code --preview
uv run agent_rollout_distillation.py --format codex --shuffle --num-records 20

Defaults

AgentRolloutSeedSource(format=AgentRolloutFormat.CLAUDE_CODE) defaults to ~/.claude/projects
AgentRolloutSeedSource(format=AgentRolloutFormat.CODEX) defaults to ~/.codex/sessions
Built-in rollout formats default file_pattern to *.jsonl
path and file_pattern stay None in serialized form when omitted (no baked-in machine-specific defaults)

Design

Architecture

Rollout ingestion is modeled as a built-in filesystem seed reader on top of feat: add built-in filesystem seed readers #421's manifest-first architecture
build_manifest() does cheap file discovery only — no file reads, no JSON parsing
hydrate_row() does per-file parsing with 1:many fanout (one file can produce many normalized rows)
get_seed_dataset_size() returns the file count (manifest row count), not the parsed record count
The HydratingSeedReaderBatchReader handles fanout transparently — num_records in DataDesigner.create() works correctly because the engine fetches batches until the target record count is met

Config

AgentRolloutSeedSource uses format: AgentRolloutFormat (a plain enum) instead of a nested format config hierarchy
Default path and file-pattern resolution happens at runtime via resolved_file_pattern and get_agent_rollout_format_defaults(), keeping serialized configs declarative

Module boundaries

Each format handler owns one parse_file() entrypoint (per-file, not per-directory)
The reader owns file discovery (get_matching_relative_paths + is_handled_file), manifest construction, shared context (Claude session index), and error normalization into SeedReaderError
Claude session index scanning respects the recursive setting on the source
Shared rollout normalization helpers live in agent_rollout_seed_parser.py
Provider-specific parsing lives in agent_rollout_format_handlers.py

Fault Tolerance

Empty matched files are skipped with warnings during hydration
Malformed matched files are skipped with warnings during hydration
Unhandled matched files are skipped with warnings during manifest construction
OSError during hydration is caught and wrapped as SeedReaderError

Docs and Recipe

Updates the seed-dataset concept docs to document AgentRolloutSeedSource and the format= API
Includes a generic agent_rollout_distillation.py recipe driven by --format and --trace-dir
The recipe derives trace_digest, emits a standalone sft_record, runs an sft_quality_judge_result, and computes recommended_for_sft

Test Plan

make check-all — all lint and format checks pass
make test — all tests pass (config + engine + interface)
Config tests: round-trip serialization, default path resolution, file pattern validation
Engine tests: manifest laziness (mocked file reads prove they happen in hydration, not manifest), file-count-based get_seed_dataset_size(), OSError wrapping, Claude session index recursive setting
Interface e2e tests: both formats (Claude Code, Codex), skip/malformed/unhandled files, all-files-invalid error
Recipe verified: uv run agent_rollout_distillation.py --format claude_code --num-records 2 --preview

🤖 Generated with Claude Code

greptile-apps · 2026-03-18T13:24:25Z

Greptile Summary

This PR introduces AgentRolloutSeedSource and AgentRolloutSeedReader, a lazy manifest/hydrate filesystem seed reader that normalises Claude Code and Codex agent rollout traces into a common row schema for use in prompts, expression columns, and SFT distillation pipelines. It builds cleanly on top of the FileSystemSeedReader base from #421 and ships with a full distillation recipe.

Key additions:

AgentRolloutFormat enum (claude_code, codex), AgentRolloutSeedSource config, and AgentRolloutSeedReader engine class
Per-format handlers (ClaudeCodeAgentRolloutFormatHandler, CodexAgentRolloutFormatHandler) with is_handled_file / parse_file interface
Shared normalisation helpers in agent_rollout_seed_parser.py (message role mapping, content-block coercion, session-index loading)
agent_rollout_distillation.py recipe for trace → SFT pipeline with LLM judge scoring
Docs updates for DirectorySeedSource, FileContentsSeedSource, and the new AgentRolloutSeedSource

Issues found:

The PR description prominently documents a third format value CHAT_COMPLETION_JSONL (including a Quick Start code snippet), but it is entirely absent from the AgentRolloutFormat enum, the handler registry, get_agent_rollout_format_defaults, and all tests. If deferred, the PR description should be updated.
validate_resolved_path_exists assigns self._runtime_path after validating the resolved path, but the inherited FileSystemSeedSource.model_post_init (which pydantic v2 calls after all validators) resets it to None when path is None. The runtime_path property handles this via lazy resolution so there is no runtime bug, but the assignment in the validator is dead work and the interaction is subtle.
test_claude_session_index_scanning_respects_recursive_false does not verify that recursive=False prevents scanning into subdirectories — both the recursive and non-recursive readers in the test find the same index file at the same level, producing identical results.

Confidence Score: 4/5

Safe to merge with minor issues; the main blocker is a PR description/code mismatch for CHAT_COMPLETION_JSONL that should be clarified before merge.
The core manifest/hydrate architecture is sound, the format handlers are well-structured, and test coverage is thorough for the two implemented formats. No runtime bugs were found. Score is 4 rather than 5 because of the CHAT_COMPLETION_JSONL discrepancy between the PR description and the code (which could land confusing docs if merged as-is), the subtle model_post_init interaction that leaves dead code in the validator, and the test gap for recursive=False isolation.
packages/data-designer-config/src/data_designer/config/seed_source.py (AgentRolloutFormat enum and validate_resolved_path_exists), packages/data-designer-engine/tests/engine/resources/test_seed_reader.py (recursive=False test coverage)

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout_seed_parser.py	New file: shared normalization helpers for all agent rollout formats. Well-structured with clear error handling, safe `coerce_optional_str` wrappers, and properly isolated helpers. No blocking issues.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py	New file: per-format handlers for Claude Code and Codex. `CHAT_COMPLETION_JSONL` is mentioned in the PR description but has no corresponding handler here. Minor session_index None-guard redundancy in ClaudeCode handler.
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py	Adds AgentRolloutSeedReader with correct manifest/hydrate split, lazy reader_context caching, and SeedReaderError wrapping for OSErrors. Logic is sound.
packages/data-designer-config/src/data_designer/config/seed_source.py	Adds AgentRolloutSeedSource with optional path/file_pattern, model validator for resolved-path existence check, and runtime_path property with lazy default resolution. `validate_resolved_path_exists` redundantly re-validates an explicit path already validated by the field validator, and its `_runtime_path` assignment is overwritten by the inherited `model_post_init`.
packages/data-designer-engine/tests/engine/resources/test_seed_reader.py	Good new coverage for manifest laziness, OSError wrapping, file-count sizing, and session-index scanning. The `test_claude_session_index_scanning_respects_recursive_false` test doesn't actually verify that recursive=False prevents scanning into subdirectories of root_path.
docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py	Comprehensive SFT distillation recipe with judge scoring and partition support. `parse_args()` naming is already flagged in prior threads.
packages/data-designer-config/tests/config/test_seed_source.py	Adds AgentRolloutSeedSource config tests: round-trip serialization, default path resolution, and file-pattern validation. Coverage looks solid.
packages/data-designer/tests/interface/test_data_designer.py	Adds interface-level e2e tests. No issues observed in the diff.

Sequence Diagram

sequenceDiagram
    participant User
    participant AgentRolloutSeedSource
    participant AgentRolloutSeedReader
    participant FormatHandler
    participant Parser

    User->>AgentRolloutSeedSource: AgentRolloutSeedSource(format=CLAUDE_CODE)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_path (field validator)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_resolved_path_exists (model validator)
    AgentRolloutSeedSource->>AgentRolloutSeedSource: model_post_init (resets _runtime_path if path=None)

    User->>AgentRolloutSeedReader: attach(source, resolver)
    User->>AgentRolloutSeedReader: get_seed_dataset_size()
    AgentRolloutSeedReader->>FormatHandler: get_format_handler()
    AgentRolloutSeedReader->>AgentRolloutSeedReader: build_manifest() — file discovery only, no reads
    AgentRolloutSeedReader-->>User: file count (manifest rows)

    User->>AgentRolloutSeedReader: create_batch_reader().read_next_batch()
    AgentRolloutSeedReader->>AgentRolloutSeedReader: _get_reader_context() — loads Claude session index once
    loop for each manifest row
        AgentRolloutSeedReader->>FormatHandler: hydrate_row(manifest_row)
        FormatHandler->>Parser: parse_file(root_path, relative_path, reader_context)
        Parser->>Parser: load_jsonl_rows() — reads file
        Parser->>Parser: normalize messages (Claude/Codex specific)
        Parser->>Parser: build_agent_rollout_record()
        Parser-->>FormatHandler: list[NormalizedAgentRolloutRecord]
        FormatHandler-->>AgentRolloutSeedReader: list[dict] (1:many fanout)
    end
    AgentRolloutSeedReader-->>User: Arrow batch (record count >= num_records)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 196

Comment:
**`AgentRolloutFormat` missing `CHAT_COMPLETION_JSONL` described in PR summary**

The PR description explicitly lists `chat_completion_jsonl` as a supported `AgentRolloutFormat` value and provides a full Quick Start example for it:

```python
seed_source = dd.AgentRolloutSeedSource(
    path="trace-data/chat-completions",
    format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL,
)
```

However, the actual enum only defines two values:

```python
class AgentRolloutFormat(StrEnum):
    CLAUDE_CODE = "claude_code"
    CODEX = "codex"
```

`CHAT_COMPLETION_JSONL` is absent from the enum, `get_agent_rollout_format_defaults`, `BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS`, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 231-241

Comment:
**Redundant path validation and overwritten `_runtime_path` assignment in model validator**

`validate_resolved_path_exists` calls `_validate_filesystem_seed_source_path(resolved_path)` for an explicit path that has already been validated by the `validate_path` field validator just above it. The double-check is harmless but unnecessary.

More subtly: the assignment `self._runtime_path = _resolve_filesystem_runtime_path(resolved_path)` on line 239 is overwritten by the inherited `FileSystemSeedSource.model_post_init`, which pydantic v2 calls _after_ all model validators. When `self.path is None`, `model_post_init` resets `_runtime_path` back to `None`. The `runtime_path` property correctly compensates with lazy resolution, so there is no runtime bug — but the assignment in the validator is dead work for the default-path case.

Consider either removing the `_runtime_path` assignment from `validate_resolved_path_exists` (relying entirely on the property's lazy path), or overriding `model_post_init` in `AgentRolloutSeedSource` to set the cache correctly after validation completes:

```python
def model_post_init(self, __context: Any) -> None:
    default_path, _ = get_agent_rollout_format_defaults(self.format)
    resolved_path = self.path or default_path
    self._runtime_path = _resolve_filesystem_runtime_path(resolved_path)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/resources/test_seed_reader.py
Line: 946-980

Comment:
**Test doesn't actually verify that `recursive=False` prevents scanning into subdirectories**

The non-recursive reader is attached with `path=str(session_dir)`, so its `root_path` is already the leaf directory (`project-a/`). The `sessions-index.json` is located _directly_ in `session_dir`, not in a deeper subdirectory. Both `glob("sessions-index.json")` and `rglob("sessions-index.json")` find it at that level, so the recursive and non-recursive cases produce identical results and the assertion `list(non_recursive_df["project_path"]) == ["/from-nested-index"]` doesn't distinguish between the two modes.

To actually cover the `recursive=False` isolation contract, consider a setup where:
- `root_path = tmp_path` (not `session_dir`)
- `sessions-index.json` lives only in a subdirectory of `root_path`
- The non-recursive reader (pointed at `tmp_path`) should _not_ find the nested index and should fall back to `cwd`
- The recursive reader should find it

```python
# Non-recursive pointed at tmp_path should NOT pick up sessions-index.json
# nested inside project-a/ — project_path should fall back to cwd
reader_non_recursive.attach(
    AgentRolloutSeedSource(
        path=str(tmp_path),   # root is tmp_path, index is nested under project-a/
        format=AgentRolloutFormat.CLAUDE_CODE,
        file_pattern="*.jsonl",
        recursive=False,
    ),
    PlaintextResolver(),
)
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: "refactor: remove cha..."}

packages/data-designer-config/tests/config/test_seed_source.py

...ges/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py

…eader Address all 6 findings from the architecture review of the agent rollout seed reader: 1. (High) Preserve manifest/hydrate split: build_manifest() now does cheap file discovery only; hydrate_row() does per-file parsing with 1:many fanout. Removes eager _normalized_records_by_locator cache. 2. (Medium) Simplify config surface: remove AgentRolloutFormatConfig hierarchy (5 classes), replace with format: AgentRolloutFormat enum. Serialized configs preserve None for path/file_pattern instead of baking in machine-specific defaults. 3. (Medium) Centralize Claude session index scanning in the reader via lazy AgentRolloutReaderContext. Respect recursive=False setting. 4. (Medium) Wrap OSError in hydrate_row() as SeedReaderError so file I/O errors don't leak past the seed-reader boundary. 5. (Medium) Make chat-completion file ingestion atomic — if any row fails to parse, the entire file is rejected. 6. (Low) Fix fallback trace_id from file_path.stem:line_number to relative_path:line_number to prevent collisions across same-stem files in different directories. Adds targeted contract tests for manifest laziness, file-count-based get_seed_dataset_size(), OSError wrapping, atomic file rejection, trace_id collision prevention, and recursive session index scanning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-19T13:32:11Z

docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py

+        choices=[rollout_format.value for rollout_format in dd.AgentRolloutFormat],
+        help="Built-in rollout format to read.",
+    )
+    parser.add_argument(
+        "--trace-dir",
+        type=Path,
+        default=None,
+        help=(
+            "Optional directory containing rollout JSONL files. When omitted, `claude_code` defaults to "
+            "~/.claude/projects and `codex` defaults to ~/.codex/sessions. `chat_completion_jsonl` "
+            "requires an explicit path."
+        ),
+    )
+    parser.add_argument("--model-alias", type=str, default="nvidia-super")
+    parser.add_argument("--num-records", type=int, default=5)
+    parser.add_argument("--artifact-path", type=str, default=None)
+    parser.add_argument("--dataset-name", type=str, default="agent_rollout_trace_workflows")
+    parser.add_argument(
+        "--preview",
+        action="store_true",
+        help="Run the recipe in preview mode and keep the generated dataset in memory.",
+    )
+    parser.add_argument(
+        "--shuffle",
+        action="store_true",
+        help="Shuffle the normalized trace rows before sampling.",
+    )
+    parser.add_argument(
+        "--partition-index",
+        type=int,
+        default=None,
+        help="Optional partition index for large trace corpora.",
+    )
+    parser.add_argument(
+        "--num-partitions",
+        type=int,
+        default=None,
+        help="Optional total number of partitions for large trace corpora.",
+    )
+    return parser
+
+
+def resolve_selection_strategy(
+    partition_index: int | None,
+    num_partitions: int | None,
+) -> dd.PartitionBlock | None:


parse_args returns an ArgumentParser, not parsed arguments

The function is named parse_args but returns the ArgumentParser object itself. Callers must chain a second .parse_args() call (parse_args().parse_args()), which is semantically surprising. The return type annotation (-> ArgumentParser) is technically accurate, but the naming convention parse_args strongly implies the returned value is the parsed Namespace.

Consider either renaming the function to build_arg_parser / create_parser, or having it return the parsed args directly:

Suggested change

choices=[rollout_format.value for rollout_format in dd.AgentRolloutFormat],

help="Built-in rollout format to read.",

)

parser.add_argument(

"--trace-dir",

type=Path,

default=None,

help=(

"Optional directory containing rollout JSONL files. When omitted, `claude_code` defaults to "

"~/.claude/projects and `codex` defaults to ~/.codex/sessions. `chat_completion_jsonl` "

"requires an explicit path."

),

)

parser.add_argument("--model-alias", type=str, default="nvidia-super")

parser.add_argument("--num-records", type=int, default=5)

parser.add_argument("--artifact-path", type=str, default=None)

parser.add_argument("--dataset-name", type=str, default="agent_rollout_trace_workflows")

parser.add_argument(

"--preview",

action="store_true",

help="Run the recipe in preview mode and keep the generated dataset in memory.",

)

parser.add_argument(

"--shuffle",

action="store_true",

help="Shuffle the normalized trace rows before sampling.",

)

parser.add_argument(

"--partition-index",

type=int,

default=None,

help="Optional partition index for large trace corpora.",

)

parser.add_argument(

"--num-partitions",

type=int,

default=None,

help="Optional total number of partitions for large trace corpora.",

)

return parser

def resolve_selection_strategy(

partition_index: int | None,

num_partitions: int | None,

) -> dd.PartitionBlock | None:

def parse_args() -> ArgumentParser:

def create_parser() -> ArgumentParser:

And updating the call site in main():

args = create_parser().parse_args()

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py Line: 411-456 Comment: **`parse_args` returns an `ArgumentParser`, not parsed arguments** The function is named `parse_args` but returns the `ArgumentParser` object itself. Callers must chain a second `.parse_args()` call (`parse_args().parse_args()`), which is semantically surprising. The return type annotation (`-> ArgumentParser`) is technically accurate, but the naming convention `parse_args` strongly implies the returned value is the parsed `Namespace`. Consider either renaming the function to `build_arg_parser` / `create_parser`, or having it return the parsed args directly: ```suggestion def parse_args() -> ArgumentParser: def create_parser() -> ArgumentParser: ``` And updating the call site in `main()`: ```python args = create_parser().parse_args() ``` How can I resolve this? If you propose a fix, please make it concise.

…stion The chat-completion format is underspecified and adds ~170 LOC of format-specific code plus test/doc overhead. Deferring it to a future PR keeps this one focused on the two well-defined agent harness formats (Claude Code and Codex). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-19T14:00:26Z

packages/data-designer-config/src/data_designer/config/seed_source.py

+class AgentRolloutFormat(StrEnum):
+    CLAUDE_CODE = "claude_code"
+    CODEX = "codex"
+


AgentRolloutFormat missing CHAT_COMPLETION_JSONL described in PR summary

The PR description explicitly lists chat_completion_jsonl as a supported AgentRolloutFormat value and provides a full Quick Start example for it:

seed_source = dd.AgentRolloutSeedSource( path="trace-data/chat-completions", format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL, )

However, the actual enum only defines two values:

class AgentRolloutFormat(StrEnum): CLAUDE_CODE = "claude_code" CODEX = "codex"

CHAT_COMPLETION_JSONL is absent from the enum, get_agent_rollout_format_defaults, BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/data-designer-config/src/data_designer/config/seed_source.py Line: 196 Comment: **`AgentRolloutFormat` missing `CHAT_COMPLETION_JSONL` described in PR summary** The PR description explicitly lists `chat_completion_jsonl` as a supported `AgentRolloutFormat` value and provides a full Quick Start example for it: ```python seed_source = dd.AgentRolloutSeedSource( path="trace-data/chat-completions", format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL, ) ``` However, the actual enum only defines two values: ```python class AgentRolloutFormat(StrEnum): CLAUDE_CODE = "claude_code" CODEX = "codex" ``` `CHAT_COMPLETION_JSONL` is absent from the enum, `get_agent_rollout_format_defaults`, `BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS`, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion. How can I resolve this? If you propose a fix, please make it concise.

eric-tramel mentioned this pull request Mar 11, 2026

feat: directory seed transforms for agent trace ingestion #390

Closed

eric-tramel force-pushed the feature/directory-seed-transforms-v1 branch from 1e62c42 to e1aa97b Compare March 12, 2026 14:01

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 4ba6373 to 6c91a06 Compare March 12, 2026 14:22

eric-tramel self-assigned this Mar 13, 2026

eric-tramel added enhancement New feature or request labels Mar 13, 2026

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 6c91a06 to 8b0999a Compare March 16, 2026 16:00

eric-tramel changed the title ~~feat: add built-in trace directory normalizers~~ feat: add built-in trace seed sources Mar 16, 2026

eric-tramel changed the base branch from feature/directory-seed-transforms-v1 to feature/filesystem-seed-readers-v1 March 16, 2026 16:00

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from dd07c95 to 909ff76 Compare March 16, 2026 17:08

andreatgretel mentioned this pull request Mar 16, 2026

feat: add built-in filesystem seed readers #421

Merged

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch 2 times, most recently from b2b95cd to 1913096 Compare March 18, 2026 02:10

eric-tramel changed the base branch from feature/filesystem-seed-readers-v1 to main March 18, 2026 12:31

eric-tramel changed the title ~~feat: add built-in trace seed sources~~ feat: add AgentRollout seed source and formats Mar 18, 2026

eric-tramel marked this pull request as ready for review March 18, 2026 13:18

eric-tramel requested a review from a team as a code owner March 18, 2026 13:18

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

packages/data-designer-config/tests/config/test_seed_source.py Show resolved Hide resolved

...ges/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py Outdated Show resolved Hide resolved

eric-tramel changed the title ~~feat: add AgentRollout seed source and formats~~ feat: add AgentRollout seed source with lazy manifest/hydrate architecture Mar 19, 2026

eric-tramel added 9 commits March 19, 2026 09:25

feat: add built-in filesystem seed readers

cf61d45

refactor: simplify filesystem seed reader plugin hooks

dbc4f8b

fix: preserve filesystem seed source path input

dc66eab

test: add filesystem seed reader e2e coverage

5927323

feat: add built-in trace seed sources

bb01f97

docs: turn Claude trace recipe into SFT curation pipeline

97d20f6

test: drop stale seed dataset batch reader fixture coverage

ae436c6

refactor: collapse trace seed sources into AgentRollout configs

30e4fd7

docs: align rollout docs with AgentRollout naming

d0abd73

eric-tramel and others added 5 commits March 19, 2026 09:25

refactor: shorten chat completion rollout config name

17cf82d

docs: generalize rollout distillation recipe

5825ba2

fix: restore filesystem seed reader fanout compatibility

d239fc2

fix: default rollout file pattern in reader

075c825

eric-tramel force-pushed the feature/trace-directory-normalizers-v1 branch from 091ecff to 7d56607 Compare March 19, 2026 13:26

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AgentRollout seed source with lazy manifest/hydrate architecture#399

feat: add AgentRollout seed source with lazy manifest/hydrate architecture#399
eric-tramel wants to merge 15 commits intomainfrom
feature/trace-directory-normalizers-v1

eric-tramel commented Mar 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 18, 2026 •

edited

Loading

Confidence Score: 4/5

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 19, 2026

Uh oh!

greptile-apps bot Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eric-tramel commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

What This Adds

Quick Start

What You Get

What This Unlocks

Defaults

Design

Architecture

Config

Module boundaries

Fault Tolerance

Docs and Recipe

Test Plan

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eric-tramel commented Mar 11, 2026 •

edited

Loading

greptile-apps bot commented Mar 18, 2026 •

edited

Loading