Merged
Conversation
Contributor
Greptile SummaryThis PR introduces a new Key changes:
Issues found:
|
| Filename | Overview |
|---|---|
| skills/data-designer/SKILL.md | New main skill file defining the data-designer slash command. Outlines workflow selection (interactive vs. autopilot), rules, usage tips, troubleshooting, and an output template. The output template still unconditionally imports pydantic and includes the BaseModel example despite the closing note saying to only include it when needed — an inconsistency carried over from pre-review state. |
| skills/data-designer/workflows/interactive.md | New interactive workflow. Most previous review issues are addressed (large-record warning added, HTTP server removed), but step 7 retains the stale phrase "serve again" from when an HTTP server was part of the workflow, which could cause agent confusion. |
| skills/data-designer/workflows/autopilot.md | New autopilot workflow. Previous review issues (cd contamination, background server, missing large-record guard) have all been fixed. The file looks clean. |
| skills/data-designer/scripts/get_person_object_schema.py | Helper script to inspect locale-specific person schema fields. Logic is straightforward and handles missing locales cleanly. However, both imports reach into deep private internal module paths (data_designer.engine.sampling_gen.entities.*) that have no stability guarantees and will break silently if the package is reorganized. |
| skills/data-designer/references/person-sampling.md | New reference document explaining person sampler types, usage patterns, and the persona schema script. Content is accurate and well-structured. |
| skills/data-designer/references/seed-datasets.md | New reference document for seed datasets. Instructs the agent to read source code before guessing parameters and to verify dataset readability upfront. Concise and correct. |
| .claude/skills/new-sdg/SKILL.md | Adds metadata: internal: true and cleans up trailing whitespace. No functional changes to the skill instructions. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A(["/data-designer <description>"]) --> B{Mode?}
B -- "opinionated / you decide / just build it" --> C[Autopilot Workflow]
B -- default --> D[Interactive Workflow]
C --> C1[1. Learn: data-designer agent context]
C1 --> C2[2. Infer design decisions autonomously]
C2 --> C3[3. Plan columns / samplers / processors]
C3 --> C4[4. Build: write load_config_builder script]
C4 --> C5[5. Validate: data-designer validate]
C5 --> C6[6. Preview: data-designer preview --save-results\nShare file:// link]
C6 --> C7{Record count\nspecified?}
C7 -- "≤50" --> C8[Run data-designer create directly]
C7 -- ">50" --> C9[Warn + ask confirmation]
C7 -- none --> C10[Skip]
C8 & C9 & C10 --> C11[8. Present summary, ask for changes]
C11 -- changes requested --> C4
D --> D1[1. Learn: data-designer agent context]
D1 --> D2[2. Clarify: ask user questions]
D2 --> D3[3. Plan, present to user]
D3 --> D4[4. Build: write load_config_builder script]
D4 --> D5[5. Validate: data-designer validate]
D5 --> D6[6. Preview: data-designer preview --save-results\nShare file:// link]
D6 --> D7[7. Iterate: feedback loop]
D7 -- not satisfied --> D4
D7 -- satisfied --> D8[8. Finalize: give user create command]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: skills/data-designer/workflows/interactive.md
Line: 27
Comment:
**Stale "serve again" instruction**
The phrase "serve again" is a leftover from when this workflow included an HTTP server step. That server was removed (the preview step now only emits a `file://` link), so there is nothing to "serve." An agent following this instruction literally may attempt to restart an HTTP server that the workflow no longer starts.
The autopilot workflow uses "iterate" for the equivalent loop (step 8: "edit the script, re-validate, re-preview, and iterate"), which is the correct phrasing now that serving is gone.
```suggestion
7. **Iterate** — Ask the user for feedback. Edit the script, re-validate, re-preview, and repeat until they are satisfied.
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: skills/data-designer/scripts/get_person_object_schema.py
Line: 20-21
Comment:
**Imports from private internal module paths**
Both imports reach deep into the package internals:
- `data_designer.config.utils.constants` (internal `utils` subpackage)
- `data_designer.engine.sampling_gen.entities.dataset_based_person_fields` (five levels deep into an `engine` namespace)
Neither is part of a documented public API surface. If the `data-designer` package reorganizes these modules across a version bump (a common occurrence in pre-1.0 libraries), this script will fail with an `ImportError` rather than a meaningful error, and the breakage won't be caught until a user runs it.
Consider requesting that `MANAGED_ASSETS_PATH`, `PERSONA_FIELDS`, and `PII_FIELDS` be exposed via a stable public API (e.g., `data_designer.constants` or a `data_designer.sampling` namespace), or add a comment here that ties the script to a specific minimum version so consumers know when to revisit it.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "Merge branch 'main' ..."
nabinchha
previously approved these changes
Mar 18, 2026
Contributor
nabinchha
left a comment
There was a problem hiding this comment.
One question about moving/using references to docs
nabinchha
previously approved these changes
Mar 18, 2026
- Replace `cd` + bare http.server with `--directory` flag to keep CWD stable for subsequent steps - Add note to stop the background server after review - Add large-record-count warning to interactive finalize step
- Use fixed port 8741 with fallback to port 0 - Require verifying server startup from background task output - Clarify sandbox network error guidance: ask to retry without sandbox before telling user to run manually
2c92772 to
fbb11d6
Compare
Allow dropping internal/helper columns (e.g., sampled person objects) that exist solely to derive other columns, while still defaulting to keeping everything else.
Instead of defaulting to the first usable alias (which could be an embedding model), default to an alias with the appropriate generation_type for each column.
Make the check section explicitly state what to do when the needed locale is not installed: use person_from_faker instead.
Add get_person_object_schema.py script that prints PII and synthetic persona fields for a given locale's managed dataset. Update person-sampling.md to use this script instead of hardcoded field lists, and remove redundant param tables already available via agent context.
The locale install status is already printed by `data-designer agent context`, which the agent runs at the start of every workflow.
andreatgretel
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/data-designer) that guides agents through building synthetic datasets with the Data Designer libraryCloses #310