Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
3a1c934
add skill
johnnygreco Mar 18, 2026
1583639
remove quotes from hint
johnnygreco Mar 18, 2026
47bd085
add internal metadata to .claude skills
johnnygreco Mar 18, 2026
f4a04e5
address review feedback: fix typo, clarify config_root, use dynamic port
johnnygreco Mar 18, 2026
ba3327b
address review feedback: use --directory flag, add server cleanup note
johnnygreco Mar 18, 2026
aefd072
improve preview server reliability and sandbox error handling
johnnygreco Mar 18, 2026
42b8dd5
ensure venv creation before installing data-designer
johnnygreco Mar 18, 2026
6031d84
verify server via background task output, not curl probing
johnnygreco Mar 18, 2026
a81afa8
replace HTTP server with file:// link for preview, add push_to_hub gu…
johnnygreco Mar 18, 2026
4068a76
add --dataset-name to create command, remove push_to_hub notes
johnnygreco Mar 18, 2026
d6878b8
update custom column example
johnnygreco Mar 18, 2026
acff87f
remove schema transform pitfall which is about to be fixed
johnnygreco Mar 18, 2026
656c6a0
tighten agent skill: remove redundancy, add missing interactive guidance
johnnygreco Mar 18, 2026
0ae610c
clarify interactive plan step: ask for changes before generating preview
johnnygreco Mar 18, 2026
666e743
improve structured question tool guidance in interactive workflow
johnnygreco Mar 18, 2026
fbb11d6
merge structured question tool guidance with UX bullet point
johnnygreco Mar 18, 2026
b81572e
soften column-dropping rule to allow dropping helper columns
johnnygreco Mar 18, 2026
fe70161
default model alias to appropriate generation_type per column
johnnygreco Mar 18, 2026
1a1ed2f
clarify missing model aliases: suggest running data-designer config
johnnygreco Mar 18, 2026
8f10ea9
close the loop on persona dataset locale check
johnnygreco Mar 18, 2026
05d4c24
add locale schema script and simplify person-sampling reference
johnnygreco Mar 18, 2026
d3ec31b
clarify script path is relative to skill directory
johnnygreco Mar 18, 2026
d5dd642
minor wording tweak in person-sampling reference
johnnygreco Mar 18, 2026
866ac71
remove redundant available locales section from person-sampling ref
johnnygreco Mar 18, 2026
285dfff
tweak
johnnygreco Mar 19, 2026
91bd843
pydantic is always included with data-designer
johnnygreco Mar 19, 2026
88f7c68
imports tweak
johnnygreco Mar 19, 2026
811cf76
Merge branch 'main' into johnny-310-data-designer-got-skill
johnnygreco Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .claude/skills/commit/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ name: commit
description: Commit current changes with a clear, descriptive message
argument-hint: [special instructions]
disable-model-invocation: true
metadata:
internal: true
---

# Commit Changes
Expand Down
2 changes: 2 additions & 0 deletions .claude/skills/create-pr/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ name: create-pr
description: Create a GitHub PR with a well-formatted description including summary, categorized changes, and attention areas
argument-hint: [special instructions]
disable-model-invocation: true
metadata:
internal: true
---

# Create Pull Request
Expand Down
22 changes: 12 additions & 10 deletions .claude/skills/new-sdg/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ name: new-sdg
description: Implement a new synthetic data generator using NeMo Data Designer by defining its configuration and executing a preview job.
argument-hint: <dataset-description>
disable-model-invocation: true
metadata:
internal: true
---

# Your Goal
Expand All @@ -18,13 +20,13 @@ Implement a new synthetic data generator using NeMo Data Designer to match the u
The user will provide you with some description, but it is likely that you
do not have enough information to precisely define what they want. It is hard
for a user to define everything up front. Ask follow up questions to the user
using the AskUser tool to narrow down on precisely what they want.
using the AskUser tool to narrow down on precisely what they want.

Common things to make precise are:

- IMPORTANT: What the "axes of diversity" are -- e.g. what should be well represented and diverse in the resulting dataset.
- The kind an nature of any input data to the dataset.
- What variables should be randomized.
- What variables should be randomized.
- The schema of the final dataset.
- The structure of any required structured output columns.
- What facets of the output dataset are important to capture.
Expand All @@ -40,22 +42,22 @@ Common things to make precise are:
> USER: Respond
> YOU: ...repeat...

Very often, the initial implementation will not conform precisely to what the user wants. You are to engage in an **iterative design loop** with the user. As shown
Very often, the initial implementation will not conform precisely to what the user wants. You are to engage in an **iterative design loop** with the user. As shown
in the example below, you will construct a configuration, then review its outputs,
present those outputs to the user, and ask follow up questions.
present those outputs to the user, and ask follow up questions.

Depending on the user responses, you will then edit the script, re-run it, and present the user with the results and ask followups and so. When showing results to the user DO NOT SUMMARIZE content, it is *very important* that you show them the records as-is so they can make thoughtful decisions.

DO NOT disengage from this **iterative design loop** unless commanded by the user.


## Implementing a NeMo Data Designer Synthetic Data Generator
## Implementing a NeMo Data Designer Synthetic Data Generator

- You will be writing a new python script for execution.
- The script should be made in the current working directory, so `$(pwd)/script-name.py`.
- Implement the script as a stand-alone, `uv`-executable script (https://docs.astral.sh/uv/guides/scripts/#creating-a-python-script).
- The script should depend on the latest version of `data-designer`.
- Include other third-party dependencies only if the job requires it.
- Include other third-party dependencies only if the job requires it.
- Model aliases are required when definining LLM generation columns.
- Before implementing, make sure to use the Explore tool to understand the src/ and docs/.
- Review available model aliases and providers.
Expand All @@ -73,7 +75,7 @@ uv run --with data-designer data-designer config list

### Real World Seed Data

Depending on user requirements, you may need to access real-world datasets to serve as Seed datasets for your Data Designer SDG.
Depending on user requirements, you may need to access real-world datasets to serve as Seed datasets for your Data Designer SDG.
In these cases, you may use Web Search tools to search for datasets available on HuggingFace, and use the `datasets` python library
to load them. You will have to convert them to Pandas DataFrames in these cases.

Expand All @@ -88,7 +90,7 @@ If you do use real-world data, pay attention to file sizes and avoid large file
# ]
# ///

# ... data designer config_builder implementation
# ... data designer config_builder implementation

def build_config() -> DataDesignerConfigBuilder:
"""Implements the definition of the synthetic data generator.
Expand All @@ -112,7 +114,7 @@ if __name__ == "__main__":
preview.display_sample_record()

# The raw data is located in this Pandas DataFrame object.
# You can implenent code to display some or all of this
# You can implenent code to display some or all of this
# to STDOUT so you can see the outputs and report to the user.
preview.dataset
```
```
2 changes: 2 additions & 0 deletions .claude/skills/review-code/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ name: review-code
description: Perform a thorough code review of the current branch or a GitHub PR by number.
argument-hint: [pr-number] [special instructions]
disable-model-invocation: true
metadata:
internal: true
---

# Review Code Changes
Expand Down
2 changes: 2 additions & 0 deletions .claude/skills/search-docs/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
name: search-docs
description: Search local documentation in the docs/ folder for content related to a topic
argument-hint: <search-topic>
metadata:
internal: true
---

# Documentation Search
Expand Down
2 changes: 2 additions & 0 deletions .claude/skills/search-github/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
name: search-github
description: Search GitHub issues, discussions, and PRs for content related to a topic
argument-hint: <search-topic>
metadata:
internal: true
---

# GitHub Search
Expand Down
2 changes: 2 additions & 0 deletions .claude/skills/update-pr/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ name: update-pr
description: Update an existing GitHub PR description to reflect current changes after incorporating feedback
argument-hint: [special instructions]
disable-model-invocation: true
metadata:
internal: true
---

# Update Pull Request
Expand Down
91 changes: 91 additions & 0 deletions skills/data-designer/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
name: data-designer
description: Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.
argument-hint: [describe the dataset you want to generate]
---

# Before You Start

Do not explore the workspace first. The workflow's Learn step gives you everything you need.

# Goal

Build a synthetic dataset using the Data Designer library that matches this description:

$ARGUMENTS

# Workflow

Use **Autopilot** mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use **Interactive** mode (default).

Read **only** the workflow file that matches the selected mode, then follow it:

- **Interactive** → read `workflows/interactive.md`
- **Autopilot** → read `workflows/autopilot.md`

# Rules

- Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column.
- Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read `references/seed-datasets.md`.
- When the dataset requires person data (names, demographics, addresses), read `references/person-sampling.md`.
- If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one.

# Usage Tips and Common Pitfalls

- **Sampler and validation columns need both a type and params.** E.g., `sampler_type="category"` with `params=dd.CategorySamplerParams(...)`.
- **Jinja2 templates** in `prompt`, `system_prompt`, and `expr` fields: reference columns with `{{ column_name }}`, nested fields with `{{ column_name.field }}`.
- **`SamplerColumnConfig`:** Takes `params`, not `sampler_params`.
- **LLM judge score access:** `LLMJudgeColumnConfig` produces a nested dict where each score name maps to `{reasoning: str, score: int}`. To get the numeric score, use the `.score` attribute. For example, for a judge column named `quality` with a score named `correctness`, use `{{ quality.correctness.score }}`. Using `{{ quality.correctness }}` returns the full dict, not the numeric score.

# Troubleshooting

- **`data-designer` command not found:** If no virtual environment exists, create one first (`python -m venv .venv && source .venv/bin/activate`), then install (`pip install data-designer`). If a virtual environment already exists, activate it and verify the package is installed.
- **Network errors during preview:** A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves.

# Output Template

Write a Python file to the current directory with a `load_config_builder()` function returning a `DataDesignerConfigBuilder`. Name the file descriptively (e.g., `customer_reviews.py`). Use PEP 723 inline metadata for dependencies.

```python
# /// script
# dependencies = [
# "data-designer", # always required
# "pydantic", # only if this script imports from pydantic
# # add additional dependencies here
# ]
# ///
import data_designer.config as dd
from pydantic import BaseModel, Field


# Use Pydantic models when the output needs to conform to a specific schema
class MyStructuredOutput(BaseModel):
field_one: str = Field(description="...")
field_two: int = Field(description="...")


# Use custom generators when built-in column types aren't enough
@dd.custom_column_generator(
required_columns=["col_a"],
side_effect_columns=["extra_col"],
)
def generator_function(row: dict) -> dict:
# add custom logic here that depends on "col_a" and update row in place
row["name_in_custom_column_config"] = "custom value"
row["extra_col"] = "extra value"
return row


def load_config_builder() -> dd.DataDesignerConfigBuilder:
config_builder = dd.DataDesignerConfigBuilder()

# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

# config_builder.add_column(...)
# config_builder.add_processor(...)

return config_builder
```

Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them.
46 changes: 46 additions & 0 deletions skills/data-designer/references/person-sampling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Person Sampling Reference

## Sampler types

Prefer `"person"` when the locale is downloaded — it provides census-grounded demographics and optional personality traits. Fall back to `"person_from_faker"` when the locale isn't available.


| `sampler_type` | Params class | When to use |
| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------- |
| `"person"` | `PersonSamplerParams` | **Preferred.** Locale downloaded to `~/.data-designer/managed-assets/datasets/` by default. |
| `"person_from_faker"` | `PersonFromFakerSamplerParams` | Fallback when locale not downloaded. Basic names/addresses via Faker, not demographically accurate. |


## Usage

The sampled person column is a nested dict. You can keep it as-is in the final dataset, or set `drop=True` to remove it and extract only the fields you need via `ExpressionColumnConfig`:

```python
# Keep the full person dict in the output
config_builder.add_column(dd.SamplerColumnConfig(
name="person", sampler_type="person",
params=dd.PersonSamplerParams(locale="en_US"),
))

# Or drop it and extract specific fields
config_builder.add_column(dd.SamplerColumnConfig(
name="person", sampler_type="person",
params=dd.PersonSamplerParams(locale="en_US"), drop=True,
))
config_builder.add_column(dd.ExpressionColumnConfig(
name="full_name",
expr="{{ person.first_name }} {{ person.last_name }}", dtype="str",
))
```

Set `with_synthetic_personas=True` when the dataset benefits from personality traits, interests, cultural background, or detailed persona descriptions (e.g., for realistic user simulation or persona-driven prompting). This option is only available with `"person"` — `"person_from_faker"` does not support it.

## Person Object Schema

Fields vary by locale. Always run the following script to get the exact schema for the locale you are using (script path is relative to this skill's directory):

```bash
python scripts/get_person_object_schema.py <locale>
```

This prints the PII fields (always included) and synthetic persona fields (only included when `with_synthetic_personas=True`) available for that locale.
14 changes: 14 additions & 0 deletions skills/data-designer/references/seed-datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Seed Datasets Reference

Seed datasets bootstrap synthetic data generation from existing data. Every column from the seed becomes a Jinja2 variable you can reference in prompts and expressions — the seed provides realism and domain specificity, and Data Designer adds volume and variation on top.

## Before configuring a seed source

1. **Read the source code.** Read `seed_source.py` under the config root directory printed by `data-designer agent context`. This file contains all seed source classes and their parameters. Do not guess types or parameters.

2. **Verify the dataset is readable and fetch column names.** Before wiring the seed into the config, confirm the file can be read and extract its column names. This catches bad paths and corrupt files, and gives you the exact column names available for downstream prompts.

## Notes

- The most common seed source is `LocalFileSeedSource` (local file on disk). Supported formats: `.parquet`, `.csv`, `.json`, `.jsonl`.
- Seed columns are automatically registered as `SeedDatasetColumnConfig` entries — you do **not** add them manually. Just reference them by name in downstream prompts and expressions.
48 changes: 48 additions & 0 deletions skills/data-designer/scripts/get_person_object_schema.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""Inspect a locale's managed persona dataset and print its available fields.

Fields are split into two groups based on the with_synthetic_personas setting:
- PII fields: always included in person sampling
- SYNTHETIC PERSONA fields: only included when with_synthetic_personas=True

Usage: python get_person_object_schema.py <locale>
Example: python get_person_object_schema.py en_US
"""

from __future__ import annotations

import sys

import pyarrow.parquet as pq

from data_designer.config.utils.constants import MANAGED_ASSETS_PATH
from data_designer.engine.sampling_gen.entities.dataset_based_person_fields import PERSONA_FIELDS, PII_FIELDS


def main(locale: str) -> None:
path = MANAGED_ASSETS_PATH / f"datasets/{locale}.parquet"
if not path.exists():
print(f"Error: locale '{locale}' does not exist (no dataset at {path})", file=sys.stderr)
sys.exit(1)

schema = {field.name: str(field.type) for field in pq.read_schema(path)}

pii = {k: v for k, v in schema.items() if k in PII_FIELDS and v != "null"}
persona = {k: v for k, v in schema.items() if k in PERSONA_FIELDS and v != "null"}

print(f"=== {locale} PII fields (always included) ({len(pii)}) ===")
for name, dtype in pii.items():
print(f" {name}: {dtype}")

print(f"\n=== {locale} SYNTHETIC PERSONA fields (with_synthetic_personas=True) ({len(persona)}) ===")
for name, dtype in persona.items():
print(f" {name}: {dtype}")


if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} <locale>", file=sys.stderr)
sys.exit(1)
main(sys.argv[1])
26 changes: 26 additions & 0 deletions skills/data-designer/workflows/autopilot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Autopilot Workflow

In this mode, make reasonable design decisions autonomously based on the dataset description. Do not ask clarifying questions — infer sensible defaults and move straight through to a working preview.

1. **Learn** — Run `data-designer agent context`.
- If no model aliases are configured, stop and tell the user to run `data-designer config` to set them up before proceeding.
- Inspect schemas for every column, sampler type, validator, and processor you plan to use.
- Never guess types or parameters — read the relevant config files first.
- Always read `base.py` for inherited fields shared by all config objects.
2. **Infer** — Based on the dataset description, make reasonable decisions for:
- Axes of diversity and what should be well represented.
- Which variables to randomize.
- The schema of the final dataset.
- The structure of any structured output columns.
- Briefly state the key decisions you made so the user can course-correct if needed.
3. **Plan** — Determine columns, samplers, processors, validators, and other dataset features needed.
4. **Build** — Write the Python script with `load_config_builder()` (see Output Template in SKILL.md).
5. **Validate** — Run `data-designer validate <path>`. Address any warnings or errors and re-validate until it passes.
6. **Preview** — Run `data-designer preview <path> --save-results` to generate sample records as HTML files.
- Note the sample records directory printed by the `data-designer preview` command
- Give the user a clickable link: `file://<sample-records-dir>/sample_records_browser.html`
7. **Create** — If the user specified a record count:
- 50 or fewer: run `data-designer create <path> --num-records <N> --dataset-name <name>` directly.
- More than 50: warn that generation can take a long time and ask for confirmation before running.
- If no record count was specified, skip this step.
8. **Present** — Summarize what was built: columns, samplers used, key design choices. If the create command was run, share the results. Ask the user if they want any changes. If so, edit the script, re-validate, re-preview, and iterate.
Loading
Loading