-
Notifications
You must be signed in to change notification settings - Fork 81
feat: add Data Designer skill #434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
3a1c934
add skill
johnnygreco 1583639
remove quotes from hint
johnnygreco 47bd085
add internal metadata to .claude skills
johnnygreco f4a04e5
address review feedback: fix typo, clarify config_root, use dynamic port
johnnygreco ba3327b
address review feedback: use --directory flag, add server cleanup note
johnnygreco aefd072
improve preview server reliability and sandbox error handling
johnnygreco 42b8dd5
ensure venv creation before installing data-designer
johnnygreco 6031d84
verify server via background task output, not curl probing
johnnygreco a81afa8
replace HTTP server with file:// link for preview, add push_to_hub gu…
johnnygreco 4068a76
add --dataset-name to create command, remove push_to_hub notes
johnnygreco d6878b8
update custom column example
johnnygreco acff87f
remove schema transform pitfall which is about to be fixed
johnnygreco 656c6a0
tighten agent skill: remove redundancy, add missing interactive guidance
johnnygreco 0ae610c
clarify interactive plan step: ask for changes before generating preview
johnnygreco 666e743
improve structured question tool guidance in interactive workflow
johnnygreco fbb11d6
merge structured question tool guidance with UX bullet point
johnnygreco b81572e
soften column-dropping rule to allow dropping helper columns
johnnygreco fe70161
default model alias to appropriate generation_type per column
johnnygreco 1a1ed2f
clarify missing model aliases: suggest running data-designer config
johnnygreco 8f10ea9
close the loop on persona dataset locale check
johnnygreco 05d4c24
add locale schema script and simplify person-sampling reference
johnnygreco d3ec31b
clarify script path is relative to skill directory
johnnygreco d5dd642
minor wording tweak in person-sampling reference
johnnygreco 866ac71
remove redundant available locales section from person-sampling ref
johnnygreco 285dfff
tweak
johnnygreco 91bd843
pydantic is always included with data-designer
johnnygreco 88f7c68
imports tweak
johnnygreco 811cf76
Merge branch 'main' into johnny-310-data-designer-got-skill
johnnygreco File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| --- | ||
| name: data-designer | ||
| description: Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline. | ||
| argument-hint: [describe the dataset you want to generate] | ||
| --- | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # Before You Start | ||
|
|
||
| Do not explore the workspace first. The workflow's Learn step gives you everything you need. | ||
|
|
||
| # Goal | ||
|
|
||
| Build a synthetic dataset using the Data Designer library that matches this description: | ||
|
|
||
| $ARGUMENTS | ||
|
|
||
| # Workflow | ||
|
|
||
| Use **Autopilot** mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use **Interactive** mode (default). | ||
|
|
||
| Read **only** the workflow file that matches the selected mode, then follow it: | ||
|
|
||
| - **Interactive** → read `workflows/interactive.md` | ||
| - **Autopilot** → read `workflows/autopilot.md` | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # Rules | ||
|
|
||
| - Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column. | ||
| - Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read `references/seed-datasets.md`. | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - When the dataset requires person data (names, demographics, addresses), read `references/person-sampling.md`. | ||
| - If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one. | ||
|
|
||
| # Usage Tips and Common Pitfalls | ||
|
|
||
| - **Sampler and validation columns need both a type and params.** E.g., `sampler_type="category"` with `params=dd.CategorySamplerParams(...)`. | ||
| - **Jinja2 templates** in `prompt`, `system_prompt`, and `expr` fields: reference columns with `{{ column_name }}`, nested fields with `{{ column_name.field }}`. | ||
| - **`SamplerColumnConfig`:** Takes `params`, not `sampler_params`. | ||
| - **LLM judge score access:** `LLMJudgeColumnConfig` produces a nested dict where each score name maps to `{reasoning: str, score: int}`. To get the numeric score, use the `.score` attribute. For example, for a judge column named `quality` with a score named `correctness`, use `{{ quality.correctness.score }}`. Using `{{ quality.correctness }}` returns the full dict, not the numeric score. | ||
|
|
||
| # Troubleshooting | ||
|
|
||
| - **`data-designer` command not found:** If no virtual environment exists, create one first (`python -m venv .venv && source .venv/bin/activate`), then install (`pip install data-designer`). If a virtual environment already exists, activate it and verify the package is installed. | ||
| - **Network errors during preview:** A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves. | ||
|
|
||
| # Output Template | ||
|
|
||
| Write a Python file to the current directory with a `load_config_builder()` function returning a `DataDesignerConfigBuilder`. Name the file descriptively (e.g., `customer_reviews.py`). Use PEP 723 inline metadata for dependencies. | ||
|
|
||
| ```python | ||
| # /// script | ||
| # dependencies = [ | ||
| # "data-designer", # always required | ||
| # "pydantic", # only if this script imports from pydantic | ||
| # # add additional dependencies here | ||
| # ] | ||
| # /// | ||
| import data_designer.config as dd | ||
| from pydantic import BaseModel, Field | ||
|
|
||
|
|
||
| # Use Pydantic models when the output needs to conform to a specific schema | ||
| class MyStructuredOutput(BaseModel): | ||
| field_one: str = Field(description="...") | ||
| field_two: int = Field(description="...") | ||
|
|
||
|
|
||
| # Use custom generators when built-in column types aren't enough | ||
| @dd.custom_column_generator( | ||
| required_columns=["col_a"], | ||
| side_effect_columns=["extra_col"], | ||
| ) | ||
| def generator_function(row: dict) -> dict: | ||
| # add custom logic here that depends on "col_a" and update row in place | ||
| row["name_in_custom_column_config"] = "custom value" | ||
| row["extra_col"] = "extra value" | ||
| return row | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| def load_config_builder() -> dd.DataDesignerConfigBuilder: | ||
| config_builder = dd.DataDesignerConfigBuilder() | ||
|
|
||
| # Seed dataset (only if the user explicitly mentions a seed dataset path) | ||
| # config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet")) | ||
|
|
||
| # config_builder.add_column(...) | ||
| # config_builder.add_processor(...) | ||
|
|
||
| return config_builder | ||
| ``` | ||
|
|
||
| Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them. | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # Person Sampling Reference | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Sampler types | ||
|
|
||
| Prefer `"person"` when the locale is downloaded — it provides census-grounded demographics and optional personality traits. Fall back to `"person_from_faker"` when the locale isn't available. | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| | `sampler_type` | Params class | When to use | | ||
| | --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------- | | ||
| | `"person"` | `PersonSamplerParams` | **Preferred.** Locale downloaded to `~/.data-designer/managed-assets/datasets/` by default. | | ||
| | `"person_from_faker"` | `PersonFromFakerSamplerParams` | Fallback when locale not downloaded. Basic names/addresses via Faker, not demographically accurate. | | ||
|
|
||
|
|
||
| ## Usage | ||
|
|
||
| The sampled person column is a nested dict. You can keep it as-is in the final dataset, or set `drop=True` to remove it and extract only the fields you need via `ExpressionColumnConfig`: | ||
|
|
||
| ```python | ||
| # Keep the full person dict in the output | ||
| config_builder.add_column(dd.SamplerColumnConfig( | ||
| name="person", sampler_type="person", | ||
| params=dd.PersonSamplerParams(locale="en_US"), | ||
| )) | ||
|
|
||
| # Or drop it and extract specific fields | ||
| config_builder.add_column(dd.SamplerColumnConfig( | ||
| name="person", sampler_type="person", | ||
| params=dd.PersonSamplerParams(locale="en_US"), drop=True, | ||
| )) | ||
| config_builder.add_column(dd.ExpressionColumnConfig( | ||
| name="full_name", | ||
| expr="{{ person.first_name }} {{ person.last_name }}", dtype="str", | ||
| )) | ||
| ``` | ||
|
|
||
| Set `with_synthetic_personas=True` when the dataset benefits from personality traits, interests, cultural background, or detailed persona descriptions (e.g., for realistic user simulation or persona-driven prompting). This option is only available with `"person"` — `"person_from_faker"` does not support it. | ||
|
|
||
| ## Person Object Schema | ||
|
|
||
| Fields vary by locale. Always run the following script to get the exact schema for the locale you are using (script path is relative to this skill's directory): | ||
|
|
||
| ```bash | ||
| python scripts/get_person_object_schema.py <locale> | ||
| ``` | ||
johnnygreco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| This prints the PII fields (always included) and synthetic persona fields (only included when `with_synthetic_personas=True`) available for that locale. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| # Seed Datasets Reference | ||
|
|
||
| Seed datasets bootstrap synthetic data generation from existing data. Every column from the seed becomes a Jinja2 variable you can reference in prompts and expressions — the seed provides realism and domain specificity, and Data Designer adds volume and variation on top. | ||
|
|
||
| ## Before configuring a seed source | ||
|
|
||
| 1. **Read the source code.** Read `seed_source.py` under the config root directory printed by `data-designer agent context`. This file contains all seed source classes and their parameters. Do not guess types or parameters. | ||
|
|
||
| 2. **Verify the dataset is readable and fetch column names.** Before wiring the seed into the config, confirm the file can be read and extract its column names. This catches bad paths and corrupt files, and gives you the exact column names available for downstream prompts. | ||
|
|
||
| ## Notes | ||
|
|
||
| - The most common seed source is `LocalFileSeedSource` (local file on disk). Supported formats: `.parquet`, `.csv`, `.json`, `.jsonl`. | ||
| - Seed columns are automatically registered as `SeedDatasetColumnConfig` entries — you do **not** add them manually. Just reference them by name in downstream prompts and expressions. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Inspect a locale's managed persona dataset and print its available fields. | ||
|
|
||
| Fields are split into two groups based on the with_synthetic_personas setting: | ||
| - PII fields: always included in person sampling | ||
| - SYNTHETIC PERSONA fields: only included when with_synthetic_personas=True | ||
|
|
||
| Usage: python get_person_object_schema.py <locale> | ||
| Example: python get_person_object_schema.py en_US | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import sys | ||
|
|
||
| import pyarrow.parquet as pq | ||
|
|
||
| from data_designer.config.utils.constants import MANAGED_ASSETS_PATH | ||
| from data_designer.engine.sampling_gen.entities.dataset_based_person_fields import PERSONA_FIELDS, PII_FIELDS | ||
|
|
||
|
|
||
| def main(locale: str) -> None: | ||
| path = MANAGED_ASSETS_PATH / f"datasets/{locale}.parquet" | ||
| if not path.exists(): | ||
| print(f"Error: locale '{locale}' does not exist (no dataset at {path})", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| schema = {field.name: str(field.type) for field in pq.read_schema(path)} | ||
|
|
||
| pii = {k: v for k, v in schema.items() if k in PII_FIELDS and v != "null"} | ||
| persona = {k: v for k, v in schema.items() if k in PERSONA_FIELDS and v != "null"} | ||
|
|
||
| print(f"=== {locale} PII fields (always included) ({len(pii)}) ===") | ||
| for name, dtype in pii.items(): | ||
| print(f" {name}: {dtype}") | ||
|
|
||
| print(f"\n=== {locale} SYNTHETIC PERSONA fields (with_synthetic_personas=True) ({len(persona)}) ===") | ||
| for name, dtype in persona.items(): | ||
| print(f" {name}: {dtype}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| if len(sys.argv) != 2: | ||
| print(f"Usage: {sys.argv[0]} <locale>", file=sys.stderr) | ||
| sys.exit(1) | ||
| main(sys.argv[1]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # Autopilot Workflow | ||
|
|
||
| In this mode, make reasonable design decisions autonomously based on the dataset description. Do not ask clarifying questions — infer sensible defaults and move straight through to a working preview. | ||
|
|
||
| 1. **Learn** — Run `data-designer agent context`. | ||
| - If no model aliases are configured, stop and tell the user to run `data-designer config` to set them up before proceeding. | ||
| - Inspect schemas for every column, sampler type, validator, and processor you plan to use. | ||
| - Never guess types or parameters — read the relevant config files first. | ||
| - Always read `base.py` for inherited fields shared by all config objects. | ||
| 2. **Infer** — Based on the dataset description, make reasonable decisions for: | ||
| - Axes of diversity and what should be well represented. | ||
| - Which variables to randomize. | ||
| - The schema of the final dataset. | ||
| - The structure of any structured output columns. | ||
| - Briefly state the key decisions you made so the user can course-correct if needed. | ||
| 3. **Plan** — Determine columns, samplers, processors, validators, and other dataset features needed. | ||
| 4. **Build** — Write the Python script with `load_config_builder()` (see Output Template in SKILL.md). | ||
| 5. **Validate** — Run `data-designer validate <path>`. Address any warnings or errors and re-validate until it passes. | ||
| 6. **Preview** — Run `data-designer preview <path> --save-results` to generate sample records as HTML files. | ||
| - Note the sample records directory printed by the `data-designer preview` command | ||
| - Give the user a clickable link: `file://<sample-records-dir>/sample_records_browser.html` | ||
| 7. **Create** — If the user specified a record count: | ||
| - 50 or fewer: run `data-designer create <path> --num-records <N> --dataset-name <name>` directly. | ||
| - More than 50: warn that generation can take a long time and ask for confirmation before running. | ||
| - If no record count was specified, skip this step. | ||
| 8. **Present** — Summarize what was built: columns, samplers used, key design choices. If the create command was run, share the results. Ask the user if they want any changes. If so, edit the script, re-validate, re-preview, and iterate. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.