feat: add Data Designer skill by johnnygreco · Pull Request #434 · NVIDIA-NeMo/DataDesigner

johnnygreco · 2026-03-18T05:41:20Z

Summary

Adds a Claude Code skill (/data-designer) that guides agents through building synthetic datasets with the Data Designer library
Includes interactive and autopilot workflow modes, plus reference docs for person sampling and seed datasets
Provides an output template, common pitfalls, and troubleshooting guidance

Closes #310

greptile-apps · 2026-03-18T05:44:42Z

Greptile Summary

This PR introduces a new data-designer Claude Code skill that guides agents through building synthetic datasets with the Data Designer library. It also marks all existing internal skills with metadata: internal: true and cleans up trailing whitespace in new-sdg/SKILL.md. The skill is organized into a main SKILL.md, two workflow files (interactive and autopilot), two reference documents (person sampling, seed datasets), and a helper Python script for inspecting locale-specific person schemas.

Key changes:

New skills/data-designer/ skill with interactive and autopilot workflow modes, usage tips, common pitfalls, and an output template
All existing .claude/skills/*/SKILL.md files receive metadata: internal: true to flag them as internal-only
skills/data-designer/scripts/get_person_object_schema.py — a helper script that prints available PII and synthetic persona fields for a given locale
Structured reference docs for person sampling and seed dataset configuration

Issues found:

interactive.md step 7 retains the phrase "serve again" — a stale leftover from when the workflow ran an HTTP server. That server was removed in 3ac458c per review feedback, but "serve again" was not cleaned up. An agent following this literally may attempt to start a server that the workflow no longer manages.
get_person_object_schema.py imports from deep private module paths (data_designer.engine.sampling_gen.entities.dataset_based_person_fields, data_designer.config.utils.constants) with no public API stability guarantees; a package reorganization would silently break the script for all users.

Confidence Score: 3/5

Safe to merge for functionality, but a stale instruction and a fragile internal-API dependency should be cleaned up first.
Previous review feedback was largely addressed (HTTP server removed, large-record guard added, CWD stability fixed), but two clean-up items remain: (1) "serve again" in interactive step 7 is a stale reference that can mislead agents, and (2) the helper script uses private internal package paths that have no stability guarantee across data-designer versions. The changes are documentation/skill files, so production risk is low, but the stale wording is a direct regression from the "remove the server part entirely" fix.
skills/data-designer/workflows/interactive.md (stale "serve again"), skills/data-designer/scripts/get_person_object_schema.py (internal API imports)

Important Files Changed

Filename	Overview
skills/data-designer/SKILL.md	New main skill file defining the data-designer slash command. Outlines workflow selection (interactive vs. autopilot), rules, usage tips, troubleshooting, and an output template. The output template still unconditionally imports pydantic and includes the BaseModel example despite the closing note saying to only include it when needed — an inconsistency carried over from pre-review state.
skills/data-designer/workflows/interactive.md	New interactive workflow. Most previous review issues are addressed (large-record warning added, HTTP server removed), but step 7 retains the stale phrase "serve again" from when an HTTP server was part of the workflow, which could cause agent confusion.
skills/data-designer/workflows/autopilot.md	New autopilot workflow. Previous review issues (cd contamination, background server, missing large-record guard) have all been fixed. The file looks clean.
skills/data-designer/scripts/get_person_object_schema.py	Helper script to inspect locale-specific person schema fields. Logic is straightforward and handles missing locales cleanly. However, both imports reach into deep private internal module paths (data_designer.engine.sampling_gen.entities.*) that have no stability guarantees and will break silently if the package is reorganized.
skills/data-designer/references/person-sampling.md	New reference document explaining person sampler types, usage patterns, and the persona schema script. Content is accurate and well-structured.
skills/data-designer/references/seed-datasets.md	New reference document for seed datasets. Instructs the agent to read source code before guessing parameters and to verify dataset readability upfront. Concise and correct.
.claude/skills/new-sdg/SKILL.md	Adds metadata: internal: true and cleans up trailing whitespace. No functional changes to the skill instructions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A(["/data-designer &lt;description&gt;"]) --> B{Mode?}
    B -- "opinionated / you decide / just build it" --> C[Autopilot Workflow]
    B -- default --> D[Interactive Workflow]

    C --> C1[1. Learn: data-designer agent context]
    C1 --> C2[2. Infer design decisions autonomously]
    C2 --> C3[3. Plan columns / samplers / processors]
    C3 --> C4[4. Build: write load_config_builder script]
    C4 --> C5[5. Validate: data-designer validate]
    C5 --> C6[6. Preview: data-designer preview --save-results\nShare file:// link]
    C6 --> C7{Record count\nspecified?}
    C7 -- "≤50" --> C8[Run data-designer create directly]
    C7 -- ">50" --> C9[Warn + ask confirmation]
    C7 -- none --> C10[Skip]
    C8 & C9 & C10 --> C11[8. Present summary, ask for changes]
    C11 -- changes requested --> C4

    D --> D1[1. Learn: data-designer agent context]
    D1 --> D2[2. Clarify: ask user questions]
    D2 --> D3[3. Plan, present to user]
    D3 --> D4[4. Build: write load_config_builder script]
    D4 --> D5[5. Validate: data-designer validate]
    D5 --> D6[6. Preview: data-designer preview --save-results\nShare file:// link]
    D6 --> D7[7. Iterate: feedback loop]
    D7 -- not satisfied --> D4
    D7 -- satisfied --> D8[8. Finalize: give user create command]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: skills/data-designer/workflows/interactive.md
Line: 27

Comment:
**Stale "serve again" instruction**

The phrase "serve again" is a leftover from when this workflow included an HTTP server step. That server was removed (the preview step now only emits a `file://` link), so there is nothing to "serve." An agent following this instruction literally may attempt to restart an HTTP server that the workflow no longer starts.

The autopilot workflow uses "iterate" for the equivalent loop (step 8: "edit the script, re-validate, re-preview, and iterate"), which is the correct phrasing now that serving is gone.

```suggestion
7. **Iterate** — Ask the user for feedback. Edit the script, re-validate, re-preview, and repeat until they are satisfied.
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: skills/data-designer/scripts/get_person_object_schema.py
Line: 20-21

Comment:
**Imports from private internal module paths**

Both imports reach deep into the package internals:
- `data_designer.config.utils.constants` (internal `utils` subpackage)
- `data_designer.engine.sampling_gen.entities.dataset_based_person_fields` (five levels deep into an `engine` namespace)

Neither is part of a documented public API surface. If the `data-designer` package reorganizes these modules across a version bump (a common occurrence in pre-1.0 libraries), this script will fail with an `ImportError` rather than a meaningful error, and the breakage won't be caught until a user runs it.

Consider requesting that `MANAGED_ASSETS_PATH`, `PERSONA_FIELDS`, and `PII_FIELDS` be exposed via a stable public API (e.g., `data_designer.constants` or a `data_designer.sampling` namespace), or add a comment here that ties the script to a specific minimum version so consumers know when to revisit it.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: "Merge branch 'main' ..."}

skills/data-designer/workflows/autopilot.md

skills/data-designer/workflows/interactive.md

skills/data-designer/SKILL.md

nabinchha

One question about moving/using references to docs

skills/data-designer/references/person-sampling.md

skills/data-designer/SKILL.md

.claude/skills/new-sdg/SKILL.md

skills/data-designer/SKILL.md

- Replace `cd` + bare http.server with `--directory` flag to keep CWD stable for subsequent steps - Add note to stop the background server after review - Add large-record-count warning to interactive finalize step

- Use fixed port 8741 with fallback to port 0 - Require verifying server startup from background task output - Clarify sandbox network error guidance: ask to retry without sandbox before telling user to run manually

…idance

skills/data-designer/SKILL.md

skills/data-designer/references/person-sampling.md

skills/data-designer/SKILL.md

skills/data-designer/workflows/interactive.md

Allow dropping internal/helper columns (e.g., sampled person objects) that exist solely to derive other columns, while still defaulting to keeping everything else.

Instead of defaulting to the first usable alias (which could be an embedding model), default to an alias with the appropriate generation_type for each column.

skills/data-designer/SKILL.md

skills/data-designer/workflows/interactive.md

Make the check section explicitly state what to do when the needed locale is not installed: use person_from_faker instead.

Add get_person_object_schema.py script that prints PII and synthetic persona fields for a given locale's managed dataset. Update person-sampling.md to use this script instead of hardcoded field lists, and remove redundant param tables already available via agent context.

The locale install status is already printed by `data-designer agent context`, which the agent runs at the start of every workflow.

skills/data-designer/references/person-sampling.md

johnnygreco requested a review from a team as a code owner March 18, 2026 05:41

johnnygreco changed the title ~~feat: add Data Designer skill for Claude Code~~ feat: add Data Designer skill Mar 18, 2026

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

nabinchha previously approved these changes Mar 18, 2026

View reviewed changes

skills/data-designer/references/person-sampling.md Show resolved Hide resolved

johnnygreco dismissed nabinchha’s stale review via 3ac458c March 18, 2026 16:36

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

skills/data-designer/SKILL.md Show resolved Hide resolved

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

.claude/skills/new-sdg/SKILL.md Show resolved Hide resolved

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

skills/data-designer/SKILL.md Show resolved Hide resolved

nabinchha previously approved these changes Mar 18, 2026

View reviewed changes

johnnygreco added 16 commits March 18, 2026 13:33

add skill

3a1c934

remove quotes from hint

1583639

add internal metadata to .claude skills

47bd085

address review feedback: fix typo, clarify config_root, use dynamic port

f4a04e5

address review feedback: use --directory flag, add server cleanup note

ba3327b

- Replace `cd` + bare http.server with `--directory` flag to keep CWD stable for subsequent steps - Add note to stop the background server after review - Add large-record-count warning to interactive finalize step

improve preview server reliability and sandbox error handling

aefd072

- Use fixed port 8741 with fallback to port 0 - Require verifying server startup from background task output - Clarify sandbox network error guidance: ask to retry without sandbox before telling user to run manually

ensure venv creation before installing data-designer

42b8dd5

verify server via background task output, not curl probing

6031d84

replace HTTP server with file:// link for preview, add push_to_hub gu…

a81afa8

…idance

add --dataset-name to create command, remove push_to_hub notes

4068a76

update custom column example

d6878b8

remove schema transform pitfall which is about to be fixed

acff87f

tighten agent skill: remove redundancy, add missing interactive guidance

656c6a0

clarify interactive plan step: ask for changes before generating preview

0ae610c

improve structured question tool guidance in interactive workflow

666e743

merge structured question tool guidance with UX bullet point

fbb11d6

johnnygreco force-pushed the johnny-310-data-designer-got-skill branch from 2c92772 to fbb11d6 Compare March 18, 2026 20:33

andreatgretel reviewed Mar 18, 2026

View reviewed changes

skills/data-designer/SKILL.md Outdated Show resolved Hide resolved

andreatgretel reviewed Mar 18, 2026

View reviewed changes

skills/data-designer/references/person-sampling.md Show resolved Hide resolved

andreatgretel reviewed Mar 18, 2026

View reviewed changes

skills/data-designer/SKILL.md Show resolved Hide resolved

andreatgretel reviewed Mar 18, 2026

View reviewed changes

skills/data-designer/workflows/interactive.md Outdated Show resolved Hide resolved

soften column-dropping rule to allow dropping helper columns

b81572e

Allow dropping internal/helper columns (e.g., sampled person objects) that exist solely to derive other columns, while still defaulting to keeping everything else.

johnnygreco dismissed nabinchha’s stale review via b81572e March 18, 2026 21:21

johnnygreco added 2 commits March 18, 2026 14:27

default model alias to appropriate generation_type per column

fe70161

Instead of defaulting to the first usable alias (which could be an embedding model), default to an alias with the appropriate generation_type for each column.

clarify missing model aliases: suggest running data-designer config

1a1ed2f

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

skills/data-designer/SKILL.md Show resolved Hide resolved

skills/data-designer/workflows/interactive.md Show resolved Hide resolved

johnnygreco added 7 commits March 18, 2026 14:33

close the loop on persona dataset locale check

8f10ea9

Make the check section explicitly state what to do when the needed locale is not installed: use person_from_faker instead.

clarify script path is relative to skill directory

d3ec31b

minor wording tweak in person-sampling reference

d5dd642

remove redundant available locales section from person-sampling ref

866ac71

The locale install status is already printed by `data-designer agent context`, which the agent runs at the start of every workflow.

tweak

285dfff

pydantic is always included with data-designer

91bd843

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

skills/data-designer/references/person-sampling.md Show resolved Hide resolved

imports tweak

88f7c68

andreatgretel approved these changes Mar 19, 2026

View reviewed changes

Merge branch 'main' into johnny-310-data-designer-got-skill

811cf76

johnnygreco merged commit 96d1956 into main Mar 19, 2026
47 checks passed

greptile-apps bot mentioned this pull request Mar 19, 2026

docs: agent-assisted development plan for DataDesigner #428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Data Designer skill#434

feat: add Data Designer skill#434
johnnygreco merged 28 commits intomainfrom
johnny-310-data-designer-got-skill

johnnygreco commented Mar 18, 2026

Uh oh!

greptile-apps bot commented Mar 18, 2026 •

edited

Loading

Confidence Score: 3/5

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabinchha left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

johnnygreco commented Mar 18, 2026

Summary

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Mar 18, 2026 •

edited

Loading