feat: add AgentRollout seed source with lazy manifest/hydrate architecture#399
feat: add AgentRollout seed source with lazy manifest/hydrate architecture#399eric-tramel wants to merge 15 commits intomainfrom
Conversation
1e62c42 to
e1aa97b
Compare
4ba6373 to
6c91a06
Compare
6c91a06 to
8b0999a
Compare
dd07c95 to
909ff76
Compare
b2b95cd to
1913096
Compare
Greptile SummaryThis PR introduces Key additions:
Issues found:
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout_seed_parser.py | New file: shared normalization helpers for all agent rollout formats. Well-structured with clear error handling, safe coerce_optional_str wrappers, and properly isolated helpers. No blocking issues. |
| packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py | New file: per-format handlers for Claude Code and Codex. CHAT_COMPLETION_JSONL is mentioned in the PR description but has no corresponding handler here. Minor session_index None-guard redundancy in ClaudeCode handler. |
| packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py | Adds AgentRolloutSeedReader with correct manifest/hydrate split, lazy reader_context caching, and SeedReaderError wrapping for OSErrors. Logic is sound. |
| packages/data-designer-config/src/data_designer/config/seed_source.py | Adds AgentRolloutSeedSource with optional path/file_pattern, model validator for resolved-path existence check, and runtime_path property with lazy default resolution. validate_resolved_path_exists redundantly re-validates an explicit path already validated by the field validator, and its _runtime_path assignment is overwritten by the inherited model_post_init. |
| packages/data-designer-engine/tests/engine/resources/test_seed_reader.py | Good new coverage for manifest laziness, OSError wrapping, file-count sizing, and session-index scanning. The test_claude_session_index_scanning_respects_recursive_false test doesn't actually verify that recursive=False prevents scanning into subdirectories of root_path. |
| docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py | Comprehensive SFT distillation recipe with judge scoring and partition support. parse_args() naming is already flagged in prior threads. |
| packages/data-designer-config/tests/config/test_seed_source.py | Adds AgentRolloutSeedSource config tests: round-trip serialization, default path resolution, and file-pattern validation. Coverage looks solid. |
| packages/data-designer/tests/interface/test_data_designer.py | Adds interface-level e2e tests. No issues observed in the diff. |
Sequence Diagram
sequenceDiagram
participant User
participant AgentRolloutSeedSource
participant AgentRolloutSeedReader
participant FormatHandler
participant Parser
User->>AgentRolloutSeedSource: AgentRolloutSeedSource(format=CLAUDE_CODE)
AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_path (field validator)
AgentRolloutSeedSource->>AgentRolloutSeedSource: validate_resolved_path_exists (model validator)
AgentRolloutSeedSource->>AgentRolloutSeedSource: model_post_init (resets _runtime_path if path=None)
User->>AgentRolloutSeedReader: attach(source, resolver)
User->>AgentRolloutSeedReader: get_seed_dataset_size()
AgentRolloutSeedReader->>FormatHandler: get_format_handler()
AgentRolloutSeedReader->>AgentRolloutSeedReader: build_manifest() — file discovery only, no reads
AgentRolloutSeedReader-->>User: file count (manifest rows)
User->>AgentRolloutSeedReader: create_batch_reader().read_next_batch()
AgentRolloutSeedReader->>AgentRolloutSeedReader: _get_reader_context() — loads Claude session index once
loop for each manifest row
AgentRolloutSeedReader->>FormatHandler: hydrate_row(manifest_row)
FormatHandler->>Parser: parse_file(root_path, relative_path, reader_context)
Parser->>Parser: load_jsonl_rows() — reads file
Parser->>Parser: normalize messages (Claude/Codex specific)
Parser->>Parser: build_agent_rollout_record()
Parser-->>FormatHandler: list[NormalizedAgentRolloutRecord]
FormatHandler-->>AgentRolloutSeedReader: list[dict] (1:many fanout)
end
AgentRolloutSeedReader-->>User: Arrow batch (record count >= num_records)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 196
Comment:
**`AgentRolloutFormat` missing `CHAT_COMPLETION_JSONL` described in PR summary**
The PR description explicitly lists `chat_completion_jsonl` as a supported `AgentRolloutFormat` value and provides a full Quick Start example for it:
```python
seed_source = dd.AgentRolloutSeedSource(
path="trace-data/chat-completions",
format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL,
)
```
However, the actual enum only defines two values:
```python
class AgentRolloutFormat(StrEnum):
CLAUDE_CODE = "claude_code"
CODEX = "codex"
```
`CHAT_COMPLETION_JSONL` is absent from the enum, `get_agent_rollout_format_defaults`, `BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS`, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 231-241
Comment:
**Redundant path validation and overwritten `_runtime_path` assignment in model validator**
`validate_resolved_path_exists` calls `_validate_filesystem_seed_source_path(resolved_path)` for an explicit path that has already been validated by the `validate_path` field validator just above it. The double-check is harmless but unnecessary.
More subtly: the assignment `self._runtime_path = _resolve_filesystem_runtime_path(resolved_path)` on line 239 is overwritten by the inherited `FileSystemSeedSource.model_post_init`, which pydantic v2 calls _after_ all model validators. When `self.path is None`, `model_post_init` resets `_runtime_path` back to `None`. The `runtime_path` property correctly compensates with lazy resolution, so there is no runtime bug — but the assignment in the validator is dead work for the default-path case.
Consider either removing the `_runtime_path` assignment from `validate_resolved_path_exists` (relying entirely on the property's lazy path), or overriding `model_post_init` in `AgentRolloutSeedSource` to set the cache correctly after validation completes:
```python
def model_post_init(self, __context: Any) -> None:
default_path, _ = get_agent_rollout_format_defaults(self.format)
resolved_path = self.path or default_path
self._runtime_path = _resolve_filesystem_runtime_path(resolved_path)
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/resources/test_seed_reader.py
Line: 946-980
Comment:
**Test doesn't actually verify that `recursive=False` prevents scanning into subdirectories**
The non-recursive reader is attached with `path=str(session_dir)`, so its `root_path` is already the leaf directory (`project-a/`). The `sessions-index.json` is located _directly_ in `session_dir`, not in a deeper subdirectory. Both `glob("sessions-index.json")` and `rglob("sessions-index.json")` find it at that level, so the recursive and non-recursive cases produce identical results and the assertion `list(non_recursive_df["project_path"]) == ["/from-nested-index"]` doesn't distinguish between the two modes.
To actually cover the `recursive=False` isolation contract, consider a setup where:
- `root_path = tmp_path` (not `session_dir`)
- `sessions-index.json` lives only in a subdirectory of `root_path`
- The non-recursive reader (pointed at `tmp_path`) should _not_ find the nested index and should fall back to `cwd`
- The recursive reader should find it
```python
# Non-recursive pointed at tmp_path should NOT pick up sessions-index.json
# nested inside project-a/ — project_path should fall back to cwd
reader_non_recursive.attach(
AgentRolloutSeedSource(
path=str(tmp_path), # root is tmp_path, index is nested under project-a/
format=AgentRolloutFormat.CLAUDE_CODE,
file_pattern="*.jsonl",
recursive=False,
),
PlaintextResolver(),
)
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "refactor: remove cha..."
...ges/data-designer-engine/src/data_designer/engine/resources/agent_rollout_format_handlers.py
Outdated
Show resolved
Hide resolved
…eader Address all 6 findings from the architecture review of the agent rollout seed reader: 1. (High) Preserve manifest/hydrate split: build_manifest() now does cheap file discovery only; hydrate_row() does per-file parsing with 1:many fanout. Removes eager _normalized_records_by_locator cache. 2. (Medium) Simplify config surface: remove AgentRolloutFormatConfig hierarchy (5 classes), replace with format: AgentRolloutFormat enum. Serialized configs preserve None for path/file_pattern instead of baking in machine-specific defaults. 3. (Medium) Centralize Claude session index scanning in the reader via lazy AgentRolloutReaderContext. Respect recursive=False setting. 4. (Medium) Wrap OSError in hydrate_row() as SeedReaderError so file I/O errors don't leak past the seed-reader boundary. 5. (Medium) Make chat-completion file ingestion atomic — if any row fails to parse, the entire file is rejected. 6. (Low) Fix fallback trace_id from file_path.stem:line_number to relative_path:line_number to prevent collisions across same-stem files in different directories. Adds targeted contract tests for manifest laziness, file-count-based get_seed_dataset_size(), OSError wrapping, atomic file rejection, trace_id collision prevention, and recursive session index scanning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
091ecff to
7d56607
Compare
| choices=[rollout_format.value for rollout_format in dd.AgentRolloutFormat], | ||
| help="Built-in rollout format to read.", | ||
| ) | ||
| parser.add_argument( | ||
| "--trace-dir", | ||
| type=Path, | ||
| default=None, | ||
| help=( | ||
| "Optional directory containing rollout JSONL files. When omitted, `claude_code` defaults to " | ||
| "~/.claude/projects and `codex` defaults to ~/.codex/sessions. `chat_completion_jsonl` " | ||
| "requires an explicit path." | ||
| ), | ||
| ) | ||
| parser.add_argument("--model-alias", type=str, default="nvidia-super") | ||
| parser.add_argument("--num-records", type=int, default=5) | ||
| parser.add_argument("--artifact-path", type=str, default=None) | ||
| parser.add_argument("--dataset-name", type=str, default="agent_rollout_trace_workflows") | ||
| parser.add_argument( | ||
| "--preview", | ||
| action="store_true", | ||
| help="Run the recipe in preview mode and keep the generated dataset in memory.", | ||
| ) | ||
| parser.add_argument( | ||
| "--shuffle", | ||
| action="store_true", | ||
| help="Shuffle the normalized trace rows before sampling.", | ||
| ) | ||
| parser.add_argument( | ||
| "--partition-index", | ||
| type=int, | ||
| default=None, | ||
| help="Optional partition index for large trace corpora.", | ||
| ) | ||
| parser.add_argument( | ||
| "--num-partitions", | ||
| type=int, | ||
| default=None, | ||
| help="Optional total number of partitions for large trace corpora.", | ||
| ) | ||
| return parser | ||
|
|
||
|
|
||
| def resolve_selection_strategy( | ||
| partition_index: int | None, | ||
| num_partitions: int | None, | ||
| ) -> dd.PartitionBlock | None: |
There was a problem hiding this comment.
parse_args returns an ArgumentParser, not parsed arguments
The function is named parse_args but returns the ArgumentParser object itself. Callers must chain a second .parse_args() call (parse_args().parse_args()), which is semantically surprising. The return type annotation (-> ArgumentParser) is technically accurate, but the naming convention parse_args strongly implies the returned value is the parsed Namespace.
Consider either renaming the function to build_arg_parser / create_parser, or having it return the parsed args directly:
| choices=[rollout_format.value for rollout_format in dd.AgentRolloutFormat], | |
| help="Built-in rollout format to read.", | |
| ) | |
| parser.add_argument( | |
| "--trace-dir", | |
| type=Path, | |
| default=None, | |
| help=( | |
| "Optional directory containing rollout JSONL files. When omitted, `claude_code` defaults to " | |
| "~/.claude/projects and `codex` defaults to ~/.codex/sessions. `chat_completion_jsonl` " | |
| "requires an explicit path." | |
| ), | |
| ) | |
| parser.add_argument("--model-alias", type=str, default="nvidia-super") | |
| parser.add_argument("--num-records", type=int, default=5) | |
| parser.add_argument("--artifact-path", type=str, default=None) | |
| parser.add_argument("--dataset-name", type=str, default="agent_rollout_trace_workflows") | |
| parser.add_argument( | |
| "--preview", | |
| action="store_true", | |
| help="Run the recipe in preview mode and keep the generated dataset in memory.", | |
| ) | |
| parser.add_argument( | |
| "--shuffle", | |
| action="store_true", | |
| help="Shuffle the normalized trace rows before sampling.", | |
| ) | |
| parser.add_argument( | |
| "--partition-index", | |
| type=int, | |
| default=None, | |
| help="Optional partition index for large trace corpora.", | |
| ) | |
| parser.add_argument( | |
| "--num-partitions", | |
| type=int, | |
| default=None, | |
| help="Optional total number of partitions for large trace corpora.", | |
| ) | |
| return parser | |
| def resolve_selection_strategy( | |
| partition_index: int | None, | |
| num_partitions: int | None, | |
| ) -> dd.PartitionBlock | None: | |
| def parse_args() -> ArgumentParser: | |
| def create_parser() -> ArgumentParser: |
And updating the call site in main():
args = create_parser().parse_args()
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/trace_ingestion/agent_rollout_distillation.py
Line: 411-456
Comment:
**`parse_args` returns an `ArgumentParser`, not parsed arguments**
The function is named `parse_args` but returns the `ArgumentParser` object itself. Callers must chain a second `.parse_args()` call (`parse_args().parse_args()`), which is semantically surprising. The return type annotation (`-> ArgumentParser`) is technically accurate, but the naming convention `parse_args` strongly implies the returned value is the parsed `Namespace`.
Consider either renaming the function to `build_arg_parser` / `create_parser`, or having it return the parsed args directly:
```suggestion
def parse_args() -> ArgumentParser:
def create_parser() -> ArgumentParser:
```
And updating the call site in `main()`:
```python
args = create_parser().parse_args()
```
How can I resolve this? If you propose a fix, please make it concise.
…stion The chat-completion format is underspecified and adds ~170 LOC of format-specific code plus test/doc overhead. Deferring it to a future PR keeps this one focused on the two well-defined agent harness formats (Claude Code and Codex). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| class AgentRolloutFormat(StrEnum): | ||
| CLAUDE_CODE = "claude_code" | ||
| CODEX = "codex" | ||
|
|
There was a problem hiding this comment.
AgentRolloutFormat missing CHAT_COMPLETION_JSONL described in PR summary
The PR description explicitly lists chat_completion_jsonl as a supported AgentRolloutFormat value and provides a full Quick Start example for it:
seed_source = dd.AgentRolloutSeedSource(
path="trace-data/chat-completions",
format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL,
)However, the actual enum only defines two values:
class AgentRolloutFormat(StrEnum):
CLAUDE_CODE = "claude_code"
CODEX = "codex"CHAT_COMPLETION_JSONL is absent from the enum, get_agent_rollout_format_defaults, BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/seed_source.py
Line: 196
Comment:
**`AgentRolloutFormat` missing `CHAT_COMPLETION_JSONL` described in PR summary**
The PR description explicitly lists `chat_completion_jsonl` as a supported `AgentRolloutFormat` value and provides a full Quick Start example for it:
```python
seed_source = dd.AgentRolloutSeedSource(
path="trace-data/chat-completions",
format=dd.AgentRolloutFormat.CHAT_COMPLETION_JSONL,
)
```
However, the actual enum only defines two values:
```python
class AgentRolloutFormat(StrEnum):
CLAUDE_CODE = "claude_code"
CODEX = "codex"
```
`CHAT_COMPLETION_JSONL` is absent from the enum, `get_agent_rollout_format_defaults`, `BUILTIN_AGENT_ROLLOUT_FORMAT_HANDLERS`, and all test files. If the format was intentionally deferred from this PR, the PR description and any user-facing docs referencing it should be updated to avoid confusion.
How can I resolve this? If you propose a fix, please make it concise.
Summary
Add a single
AgentRolloutSeedSourceandAgentRolloutSeedReaderthat normalize agent rollout traces from multiple formats (Claude Code, Codex) into a common row shape for use in prompts, expressions, and downstream curation workflows.Built on top of the
FileSystemSeedReadermanifest/hydrate architecture from #421, with the manifest phase doing cheap file discovery and the hydrate phase doing per-file parsing with 1:many fanout.Dependency
What This Adds
AgentRolloutSeedSource(format=AgentRolloutFormat.CLAUDE_CODE)— configAgentRolloutSeedReader— lazy manifest/hydrate readerAgentRolloutFormatenum —claude_code,codexis_handled_file()/parse_file()interfaceQuick Start
Claude Code with the built-in default path:
Codex with the built-in default path:
You can override the path explicitly for any format:
What You Get
Each provider-specific rollout file is normalized into a shared seed row shape:
trace_idsource_kindclaude_codeorcodex.source_pathroot_session_idagent_idis_sidechaincwdproject_pathgit_branchstarted_atended_atmessagesmessage_counttool_call_countfinal_assistant_messagesource_metaWhat This Unlocks
Once attached as a seed dataset, rollout rows can drive prompt- and expression-based generation directly:
This PR also includes a generic rollout-distillation recipe that turns rollout rows into SFT candidates:
Defaults
AgentRolloutSeedSource(format=AgentRolloutFormat.CLAUDE_CODE)defaults to~/.claude/projectsAgentRolloutSeedSource(format=AgentRolloutFormat.CODEX)defaults to~/.codex/sessionsfile_patternto*.jsonlpathandfile_patternstayNonein serialized form when omitted (no baked-in machine-specific defaults)Design
Architecture
build_manifest()does cheap file discovery only — no file reads, no JSON parsinghydrate_row()does per-file parsing with 1:many fanout (one file can produce many normalized rows)get_seed_dataset_size()returns the file count (manifest row count), not the parsed record countHydratingSeedReaderBatchReaderhandles fanout transparently —num_recordsinDataDesigner.create()works correctly because the engine fetches batches until the target record count is metConfig
AgentRolloutSeedSourceusesformat: AgentRolloutFormat(a plain enum) instead of a nested format config hierarchyresolved_file_patternandget_agent_rollout_format_defaults(), keeping serialized configs declarativeModule boundaries
parse_file()entrypoint (per-file, not per-directory)get_matching_relative_paths+is_handled_file), manifest construction, shared context (Claude session index), and error normalization intoSeedReaderErrorrecursivesetting on the sourceagent_rollout_seed_parser.pyagent_rollout_format_handlers.pyFault Tolerance
OSErrorduring hydration is caught and wrapped asSeedReaderErrorDocs and Recipe
AgentRolloutSeedSourceand theformat=APIagent_rollout_distillation.pyrecipe driven by--formatand--trace-dirtrace_digest, emits a standalonesft_record, runs ansft_quality_judge_result, and computesrecommended_for_sftTest Plan
make check-all— all lint and format checks passmake test— all tests pass (config + engine + interface)get_seed_dataset_size(), OSError wrapping, Claude session index recursive settinguv run agent_rollout_distillation.py --format claude_code --num-records 2 --preview🤖 Generated with Claude Code