Skip to content

feat: add Data Designer skill#434

Merged
johnnygreco merged 28 commits intomainfrom
johnny-310-data-designer-got-skill
Mar 19, 2026
Merged

feat: add Data Designer skill#434
johnnygreco merged 28 commits intomainfrom
johnny-310-data-designer-got-skill

Conversation

@johnnygreco
Copy link
Contributor

Summary

  • Adds a Claude Code skill (/data-designer) that guides agents through building synthetic datasets with the Data Designer library
  • Includes interactive and autopilot workflow modes, plus reference docs for person sampling and seed datasets
  • Provides an output template, common pitfalls, and troubleshooting guidance

Closes #310

@johnnygreco johnnygreco requested a review from a team as a code owner March 18, 2026 05:41
@johnnygreco johnnygreco changed the title feat: add Data Designer skill for Claude Code feat: add Data Designer skill Mar 18, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR introduces a new data-designer Claude Code skill that guides agents through building synthetic datasets with the Data Designer library. It also marks all existing internal skills with metadata: internal: true and cleans up trailing whitespace in new-sdg/SKILL.md. The skill is organized into a main SKILL.md, two workflow files (interactive and autopilot), two reference documents (person sampling, seed datasets), and a helper Python script for inspecting locale-specific person schemas.

Key changes:

  • New skills/data-designer/ skill with interactive and autopilot workflow modes, usage tips, common pitfalls, and an output template
  • All existing .claude/skills/*/SKILL.md files receive metadata: internal: true to flag them as internal-only
  • skills/data-designer/scripts/get_person_object_schema.py — a helper script that prints available PII and synthetic persona fields for a given locale
  • Structured reference docs for person sampling and seed dataset configuration

Issues found:

  • interactive.md step 7 retains the phrase "serve again" — a stale leftover from when the workflow ran an HTTP server. That server was removed in 3ac458c per review feedback, but "serve again" was not cleaned up. An agent following this literally may attempt to start a server that the workflow no longer manages.
  • get_person_object_schema.py imports from deep private module paths (data_designer.engine.sampling_gen.entities.dataset_based_person_fields, data_designer.config.utils.constants) with no public API stability guarantees; a package reorganization would silently break the script for all users.

Confidence Score: 3/5

  • Safe to merge for functionality, but a stale instruction and a fragile internal-API dependency should be cleaned up first.
  • Previous review feedback was largely addressed (HTTP server removed, large-record guard added, CWD stability fixed), but two clean-up items remain: (1) "serve again" in interactive step 7 is a stale reference that can mislead agents, and (2) the helper script uses private internal package paths that have no stability guarantee across data-designer versions. The changes are documentation/skill files, so production risk is low, but the stale wording is a direct regression from the "remove the server part entirely" fix.
  • skills/data-designer/workflows/interactive.md (stale "serve again"), skills/data-designer/scripts/get_person_object_schema.py (internal API imports)

Important Files Changed

Filename Overview
skills/data-designer/SKILL.md New main skill file defining the data-designer slash command. Outlines workflow selection (interactive vs. autopilot), rules, usage tips, troubleshooting, and an output template. The output template still unconditionally imports pydantic and includes the BaseModel example despite the closing note saying to only include it when needed — an inconsistency carried over from pre-review state.
skills/data-designer/workflows/interactive.md New interactive workflow. Most previous review issues are addressed (large-record warning added, HTTP server removed), but step 7 retains the stale phrase "serve again" from when an HTTP server was part of the workflow, which could cause agent confusion.
skills/data-designer/workflows/autopilot.md New autopilot workflow. Previous review issues (cd contamination, background server, missing large-record guard) have all been fixed. The file looks clean.
skills/data-designer/scripts/get_person_object_schema.py Helper script to inspect locale-specific person schema fields. Logic is straightforward and handles missing locales cleanly. However, both imports reach into deep private internal module paths (data_designer.engine.sampling_gen.entities.*) that have no stability guarantees and will break silently if the package is reorganized.
skills/data-designer/references/person-sampling.md New reference document explaining person sampler types, usage patterns, and the persona schema script. Content is accurate and well-structured.
skills/data-designer/references/seed-datasets.md New reference document for seed datasets. Instructs the agent to read source code before guessing parameters and to verify dataset readability upfront. Concise and correct.
.claude/skills/new-sdg/SKILL.md Adds metadata: internal: true and cleans up trailing whitespace. No functional changes to the skill instructions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A(["/data-designer <description>"]) --> B{Mode?}
    B -- "opinionated / you decide / just build it" --> C[Autopilot Workflow]
    B -- default --> D[Interactive Workflow]

    C --> C1[1. Learn: data-designer agent context]
    C1 --> C2[2. Infer design decisions autonomously]
    C2 --> C3[3. Plan columns / samplers / processors]
    C3 --> C4[4. Build: write load_config_builder script]
    C4 --> C5[5. Validate: data-designer validate]
    C5 --> C6[6. Preview: data-designer preview --save-results\nShare file:// link]
    C6 --> C7{Record count\nspecified?}
    C7 -- "≤50" --> C8[Run data-designer create directly]
    C7 -- ">50" --> C9[Warn + ask confirmation]
    C7 -- none --> C10[Skip]
    C8 & C9 & C10 --> C11[8. Present summary, ask for changes]
    C11 -- changes requested --> C4

    D --> D1[1. Learn: data-designer agent context]
    D1 --> D2[2. Clarify: ask user questions]
    D2 --> D3[3. Plan, present to user]
    D3 --> D4[4. Build: write load_config_builder script]
    D4 --> D5[5. Validate: data-designer validate]
    D5 --> D6[6. Preview: data-designer preview --save-results\nShare file:// link]
    D6 --> D7[7. Iterate: feedback loop]
    D7 -- not satisfied --> D4
    D7 -- satisfied --> D8[8. Finalize: give user create command]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: skills/data-designer/workflows/interactive.md
Line: 27

Comment:
**Stale "serve again" instruction**

The phrase "serve again" is a leftover from when this workflow included an HTTP server step. That server was removed (the preview step now only emits a `file://` link), so there is nothing to "serve." An agent following this instruction literally may attempt to restart an HTTP server that the workflow no longer starts.

The autopilot workflow uses "iterate" for the equivalent loop (step 8: "edit the script, re-validate, re-preview, and iterate"), which is the correct phrasing now that serving is gone.

```suggestion
7. **Iterate** — Ask the user for feedback. Edit the script, re-validate, re-preview, and repeat until they are satisfied.
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: skills/data-designer/scripts/get_person_object_schema.py
Line: 20-21

Comment:
**Imports from private internal module paths**

Both imports reach deep into the package internals:
- `data_designer.config.utils.constants` (internal `utils` subpackage)
- `data_designer.engine.sampling_gen.entities.dataset_based_person_fields` (five levels deep into an `engine` namespace)

Neither is part of a documented public API surface. If the `data-designer` package reorganizes these modules across a version bump (a common occurrence in pre-1.0 libraries), this script will fail with an `ImportError` rather than a meaningful error, and the breakage won't be caught until a user runs it.

Consider requesting that `MANAGED_ASSETS_PATH`, `PERSONA_FIELDS`, and `PII_FIELDS` be exposed via a stable public API (e.g., `data_designer.constants` or a `data_designer.sampling` namespace), or add a comment here that ties the script to a specific minimum version so consumers know when to revisit it.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Merge branch 'main' ..."

nabinchha
nabinchha previously approved these changes Mar 18, 2026
Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about moving/using references to docs

nabinchha
nabinchha previously approved these changes Mar 18, 2026
- Replace `cd` + bare http.server with `--directory` flag to keep CWD
  stable for subsequent steps
- Add note to stop the background server after review
- Add large-record-count warning to interactive finalize step
- Use fixed port 8741 with fallback to port 0
- Require verifying server startup from background task output
- Clarify sandbox network error guidance: ask to retry without
  sandbox before telling user to run manually
@johnnygreco johnnygreco force-pushed the johnny-310-data-designer-got-skill branch from 2c92772 to fbb11d6 Compare March 18, 2026 20:33
Allow dropping internal/helper columns (e.g., sampled person objects)
that exist solely to derive other columns, while still defaulting to
keeping everything else.
Instead of defaulting to the first usable alias (which could be an
embedding model), default to an alias with the appropriate
generation_type for each column.
Make the check section explicitly state what to do when the needed
locale is not installed: use person_from_faker instead.
Add get_person_object_schema.py script that prints PII and synthetic
persona fields for a given locale's managed dataset. Update
person-sampling.md to use this script instead of hardcoded field lists,
and remove redundant param tables already available via agent context.
The locale install status is already printed by `data-designer agent context`,
which the agent runs at the start of every workflow.
@johnnygreco johnnygreco merged commit 96d1956 into main Mar 19, 2026
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants