diff --git a/.claude/skills/commit/SKILL.md b/.claude/skills/commit/SKILL.md index 8bb69a138..b2e1c15de 100644 --- a/.claude/skills/commit/SKILL.md +++ b/.claude/skills/commit/SKILL.md @@ -3,6 +3,8 @@ name: commit description: Commit current changes with a clear, descriptive message argument-hint: [special instructions] disable-model-invocation: true +metadata: + internal: true --- # Commit Changes diff --git a/.claude/skills/create-pr/SKILL.md b/.claude/skills/create-pr/SKILL.md index 85ca1b3a0..b3f4fe941 100644 --- a/.claude/skills/create-pr/SKILL.md +++ b/.claude/skills/create-pr/SKILL.md @@ -3,6 +3,8 @@ name: create-pr description: Create a GitHub PR with a well-formatted description including summary, categorized changes, and attention areas argument-hint: [special instructions] disable-model-invocation: true +metadata: + internal: true --- # Create Pull Request diff --git a/.claude/skills/new-sdg/SKILL.md b/.claude/skills/new-sdg/SKILL.md index ce889bffb..db9081931 100644 --- a/.claude/skills/new-sdg/SKILL.md +++ b/.claude/skills/new-sdg/SKILL.md @@ -3,6 +3,8 @@ name: new-sdg description: Implement a new synthetic data generator using NeMo Data Designer by defining its configuration and executing a preview job. argument-hint: disable-model-invocation: true +metadata: + internal: true --- # Your Goal @@ -18,13 +20,13 @@ Implement a new synthetic data generator using NeMo Data Designer to match the u The user will provide you with some description, but it is likely that you do not have enough information to precisely define what they want. It is hard for a user to define everything up front. Ask follow up questions to the user -using the AskUser tool to narrow down on precisely what they want. +using the AskUser tool to narrow down on precisely what they want. Common things to make precise are: - IMPORTANT: What the "axes of diversity" are -- e.g. what should be well represented and diverse in the resulting dataset. - The kind an nature of any input data to the dataset. -- What variables should be randomized. +- What variables should be randomized. - The schema of the final dataset. - The structure of any required structured output columns. - What facets of the output dataset are important to capture. @@ -40,22 +42,22 @@ Common things to make precise are: > USER: Respond > YOU: ...repeat... -Very often, the initial implementation will not conform precisely to what the user wants. You are to engage in an **iterative design loop** with the user. As shown +Very often, the initial implementation will not conform precisely to what the user wants. You are to engage in an **iterative design loop** with the user. As shown in the example below, you will construct a configuration, then review its outputs, -present those outputs to the user, and ask follow up questions. +present those outputs to the user, and ask follow up questions. Depending on the user responses, you will then edit the script, re-run it, and present the user with the results and ask followups and so. When showing results to the user DO NOT SUMMARIZE content, it is *very important* that you show them the records as-is so they can make thoughtful decisions. DO NOT disengage from this **iterative design loop** unless commanded by the user. -## Implementing a NeMo Data Designer Synthetic Data Generator +## Implementing a NeMo Data Designer Synthetic Data Generator - You will be writing a new python script for execution. - The script should be made in the current working directory, so `$(pwd)/script-name.py`. - Implement the script as a stand-alone, `uv`-executable script (https://docs.astral.sh/uv/guides/scripts/#creating-a-python-script). - The script should depend on the latest version of `data-designer`. -- Include other third-party dependencies only if the job requires it. +- Include other third-party dependencies only if the job requires it. - Model aliases are required when definining LLM generation columns. - Before implementing, make sure to use the Explore tool to understand the src/ and docs/. - Review available model aliases and providers. @@ -73,7 +75,7 @@ uv run --with data-designer data-designer config list ### Real World Seed Data -Depending on user requirements, you may need to access real-world datasets to serve as Seed datasets for your Data Designer SDG. +Depending on user requirements, you may need to access real-world datasets to serve as Seed datasets for your Data Designer SDG. In these cases, you may use Web Search tools to search for datasets available on HuggingFace, and use the `datasets` python library to load them. You will have to convert them to Pandas DataFrames in these cases. @@ -88,7 +90,7 @@ If you do use real-world data, pay attention to file sizes and avoid large file # ] # /// -# ... data designer config_builder implementation +# ... data designer config_builder implementation def build_config() -> DataDesignerConfigBuilder: """Implements the definition of the synthetic data generator. @@ -112,7 +114,7 @@ if __name__ == "__main__": preview.display_sample_record() # The raw data is located in this Pandas DataFrame object. - # You can implenent code to display some or all of this + # You can implenent code to display some or all of this # to STDOUT so you can see the outputs and report to the user. preview.dataset -``` \ No newline at end of file +``` diff --git a/.claude/skills/review-code/SKILL.md b/.claude/skills/review-code/SKILL.md index ee817f508..922afb0bf 100644 --- a/.claude/skills/review-code/SKILL.md +++ b/.claude/skills/review-code/SKILL.md @@ -3,6 +3,8 @@ name: review-code description: Perform a thorough code review of the current branch or a GitHub PR by number. argument-hint: [pr-number] [special instructions] disable-model-invocation: true +metadata: + internal: true --- # Review Code Changes diff --git a/.claude/skills/search-docs/SKILL.md b/.claude/skills/search-docs/SKILL.md index f0898a46a..00989683b 100644 --- a/.claude/skills/search-docs/SKILL.md +++ b/.claude/skills/search-docs/SKILL.md @@ -2,6 +2,8 @@ name: search-docs description: Search local documentation in the docs/ folder for content related to a topic argument-hint: +metadata: + internal: true --- # Documentation Search diff --git a/.claude/skills/search-github/SKILL.md b/.claude/skills/search-github/SKILL.md index 9c00e422d..324d6c366 100644 --- a/.claude/skills/search-github/SKILL.md +++ b/.claude/skills/search-github/SKILL.md @@ -2,6 +2,8 @@ name: search-github description: Search GitHub issues, discussions, and PRs for content related to a topic argument-hint: +metadata: + internal: true --- # GitHub Search diff --git a/.claude/skills/update-pr/SKILL.md b/.claude/skills/update-pr/SKILL.md index 0f4b77752..69bf944f1 100644 --- a/.claude/skills/update-pr/SKILL.md +++ b/.claude/skills/update-pr/SKILL.md @@ -3,6 +3,8 @@ name: update-pr description: Update an existing GitHub PR description to reflect current changes after incorporating feedback argument-hint: [special instructions] disable-model-invocation: true +metadata: + internal: true --- # Update Pull Request diff --git a/skills/data-designer/SKILL.md b/skills/data-designer/SKILL.md new file mode 100644 index 000000000..ddee328a0 --- /dev/null +++ b/skills/data-designer/SKILL.md @@ -0,0 +1,91 @@ +--- +name: data-designer +description: Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline. +argument-hint: [describe the dataset you want to generate] +--- + +# Before You Start + +Do not explore the workspace first. The workflow's Learn step gives you everything you need. + +# Goal + +Build a synthetic dataset using the Data Designer library that matches this description: + +$ARGUMENTS + +# Workflow + +Use **Autopilot** mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use **Interactive** mode (default). + +Read **only** the workflow file that matches the selected mode, then follow it: + +- **Interactive** → read `workflows/interactive.md` +- **Autopilot** → read `workflows/autopilot.md` + +# Rules + +- Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column. +- Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read `references/seed-datasets.md`. +- When the dataset requires person data (names, demographics, addresses), read `references/person-sampling.md`. +- If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one. + +# Usage Tips and Common Pitfalls + +- **Sampler and validation columns need both a type and params.** E.g., `sampler_type="category"` with `params=dd.CategorySamplerParams(...)`. +- **Jinja2 templates** in `prompt`, `system_prompt`, and `expr` fields: reference columns with `{{ column_name }}`, nested fields with `{{ column_name.field }}`. +- **`SamplerColumnConfig`:** Takes `params`, not `sampler_params`. +- **LLM judge score access:** `LLMJudgeColumnConfig` produces a nested dict where each score name maps to `{reasoning: str, score: int}`. To get the numeric score, use the `.score` attribute. For example, for a judge column named `quality` with a score named `correctness`, use `{{ quality.correctness.score }}`. Using `{{ quality.correctness }}` returns the full dict, not the numeric score. + +# Troubleshooting + +- **`data-designer` command not found:** If no virtual environment exists, create one first (`python -m venv .venv && source .venv/bin/activate`), then install (`pip install data-designer`). If a virtual environment already exists, activate it and verify the package is installed. +- **Network errors during preview:** A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves. + +# Output Template + +Write a Python file to the current directory with a `load_config_builder()` function returning a `DataDesignerConfigBuilder`. Name the file descriptively (e.g., `customer_reviews.py`). Use PEP 723 inline metadata for dependencies. + +```python +# /// script +# dependencies = [ +# "data-designer", # always required +# "pydantic", # only if this script imports from pydantic +# # add additional dependencies here +# ] +# /// +import data_designer.config as dd +from pydantic import BaseModel, Field + + +# Use Pydantic models when the output needs to conform to a specific schema +class MyStructuredOutput(BaseModel): + field_one: str = Field(description="...") + field_two: int = Field(description="...") + + +# Use custom generators when built-in column types aren't enough +@dd.custom_column_generator( + required_columns=["col_a"], + side_effect_columns=["extra_col"], +) +def generator_function(row: dict) -> dict: + # add custom logic here that depends on "col_a" and update row in place + row["name_in_custom_column_config"] = "custom value" + row["extra_col"] = "extra value" + return row + + +def load_config_builder() -> dd.DataDesignerConfigBuilder: + config_builder = dd.DataDesignerConfigBuilder() + + # Seed dataset (only if the user explicitly mentions a seed dataset path) + # config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet")) + + # config_builder.add_column(...) + # config_builder.add_processor(...) + + return config_builder +``` + +Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them. diff --git a/skills/data-designer/references/person-sampling.md b/skills/data-designer/references/person-sampling.md new file mode 100644 index 000000000..0410da761 --- /dev/null +++ b/skills/data-designer/references/person-sampling.md @@ -0,0 +1,46 @@ +# Person Sampling Reference + +## Sampler types + +Prefer `"person"` when the locale is downloaded — it provides census-grounded demographics and optional personality traits. Fall back to `"person_from_faker"` when the locale isn't available. + + +| `sampler_type` | Params class | When to use | +| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------- | +| `"person"` | `PersonSamplerParams` | **Preferred.** Locale downloaded to `~/.data-designer/managed-assets/datasets/` by default. | +| `"person_from_faker"` | `PersonFromFakerSamplerParams` | Fallback when locale not downloaded. Basic names/addresses via Faker, not demographically accurate. | + + +## Usage + +The sampled person column is a nested dict. You can keep it as-is in the final dataset, or set `drop=True` to remove it and extract only the fields you need via `ExpressionColumnConfig`: + +```python +# Keep the full person dict in the output +config_builder.add_column(dd.SamplerColumnConfig( + name="person", sampler_type="person", + params=dd.PersonSamplerParams(locale="en_US"), +)) + +# Or drop it and extract specific fields +config_builder.add_column(dd.SamplerColumnConfig( + name="person", sampler_type="person", + params=dd.PersonSamplerParams(locale="en_US"), drop=True, +)) +config_builder.add_column(dd.ExpressionColumnConfig( + name="full_name", + expr="{{ person.first_name }} {{ person.last_name }}", dtype="str", +)) +``` + +Set `with_synthetic_personas=True` when the dataset benefits from personality traits, interests, cultural background, or detailed persona descriptions (e.g., for realistic user simulation or persona-driven prompting). This option is only available with `"person"` — `"person_from_faker"` does not support it. + +## Person Object Schema + +Fields vary by locale. Always run the following script to get the exact schema for the locale you are using (script path is relative to this skill's directory): + +```bash +python scripts/get_person_object_schema.py +``` + +This prints the PII fields (always included) and synthetic persona fields (only included when `with_synthetic_personas=True`) available for that locale. diff --git a/skills/data-designer/references/seed-datasets.md b/skills/data-designer/references/seed-datasets.md new file mode 100644 index 000000000..86e96c745 --- /dev/null +++ b/skills/data-designer/references/seed-datasets.md @@ -0,0 +1,14 @@ +# Seed Datasets Reference + +Seed datasets bootstrap synthetic data generation from existing data. Every column from the seed becomes a Jinja2 variable you can reference in prompts and expressions — the seed provides realism and domain specificity, and Data Designer adds volume and variation on top. + +## Before configuring a seed source + +1. **Read the source code.** Read `seed_source.py` under the config root directory printed by `data-designer agent context`. This file contains all seed source classes and their parameters. Do not guess types or parameters. + +2. **Verify the dataset is readable and fetch column names.** Before wiring the seed into the config, confirm the file can be read and extract its column names. This catches bad paths and corrupt files, and gives you the exact column names available for downstream prompts. + +## Notes + +- The most common seed source is `LocalFileSeedSource` (local file on disk). Supported formats: `.parquet`, `.csv`, `.json`, `.jsonl`. +- Seed columns are automatically registered as `SeedDatasetColumnConfig` entries — you do **not** add them manually. Just reference them by name in downstream prompts and expressions. diff --git a/skills/data-designer/scripts/get_person_object_schema.py b/skills/data-designer/scripts/get_person_object_schema.py new file mode 100644 index 000000000..ed2b42029 --- /dev/null +++ b/skills/data-designer/scripts/get_person_object_schema.py @@ -0,0 +1,48 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Inspect a locale's managed persona dataset and print its available fields. + +Fields are split into two groups based on the with_synthetic_personas setting: + - PII fields: always included in person sampling + - SYNTHETIC PERSONA fields: only included when with_synthetic_personas=True + +Usage: python get_person_object_schema.py +Example: python get_person_object_schema.py en_US +""" + +from __future__ import annotations + +import sys + +import pyarrow.parquet as pq + +from data_designer.config.utils.constants import MANAGED_ASSETS_PATH +from data_designer.engine.sampling_gen.entities.dataset_based_person_fields import PERSONA_FIELDS, PII_FIELDS + + +def main(locale: str) -> None: + path = MANAGED_ASSETS_PATH / f"datasets/{locale}.parquet" + if not path.exists(): + print(f"Error: locale '{locale}' does not exist (no dataset at {path})", file=sys.stderr) + sys.exit(1) + + schema = {field.name: str(field.type) for field in pq.read_schema(path)} + + pii = {k: v for k, v in schema.items() if k in PII_FIELDS and v != "null"} + persona = {k: v for k, v in schema.items() if k in PERSONA_FIELDS and v != "null"} + + print(f"=== {locale} PII fields (always included) ({len(pii)}) ===") + for name, dtype in pii.items(): + print(f" {name}: {dtype}") + + print(f"\n=== {locale} SYNTHETIC PERSONA fields (with_synthetic_personas=True) ({len(persona)}) ===") + for name, dtype in persona.items(): + print(f" {name}: {dtype}") + + +if __name__ == "__main__": + if len(sys.argv) != 2: + print(f"Usage: {sys.argv[0]} ", file=sys.stderr) + sys.exit(1) + main(sys.argv[1]) diff --git a/skills/data-designer/workflows/autopilot.md b/skills/data-designer/workflows/autopilot.md new file mode 100644 index 000000000..4fd084898 --- /dev/null +++ b/skills/data-designer/workflows/autopilot.md @@ -0,0 +1,26 @@ +# Autopilot Workflow + +In this mode, make reasonable design decisions autonomously based on the dataset description. Do not ask clarifying questions — infer sensible defaults and move straight through to a working preview. + +1. **Learn** — Run `data-designer agent context`. + - If no model aliases are configured, stop and tell the user to run `data-designer config` to set them up before proceeding. + - Inspect schemas for every column, sampler type, validator, and processor you plan to use. + - Never guess types or parameters — read the relevant config files first. + - Always read `base.py` for inherited fields shared by all config objects. +2. **Infer** — Based on the dataset description, make reasonable decisions for: + - Axes of diversity and what should be well represented. + - Which variables to randomize. + - The schema of the final dataset. + - The structure of any structured output columns. + - Briefly state the key decisions you made so the user can course-correct if needed. +3. **Plan** — Determine columns, samplers, processors, validators, and other dataset features needed. +4. **Build** — Write the Python script with `load_config_builder()` (see Output Template in SKILL.md). +5. **Validate** — Run `data-designer validate `. Address any warnings or errors and re-validate until it passes. +6. **Preview** — Run `data-designer preview --save-results` to generate sample records as HTML files. + - Note the sample records directory printed by the `data-designer preview` command + - Give the user a clickable link: `file:///sample_records_browser.html` +7. **Create** — If the user specified a record count: + - 50 or fewer: run `data-designer create --num-records --dataset-name ` directly. + - More than 50: warn that generation can take a long time and ask for confirmation before running. + - If no record count was specified, skip this step. +8. **Present** — Summarize what was built: columns, samplers used, key design choices. If the create command was run, share the results. Ask the user if they want any changes. If so, edit the script, re-validate, re-preview, and iterate. diff --git a/skills/data-designer/workflows/interactive.md b/skills/data-designer/workflows/interactive.md new file mode 100644 index 000000000..81d22c943 --- /dev/null +++ b/skills/data-designer/workflows/interactive.md @@ -0,0 +1,30 @@ +# Interactive Workflow + +This is an interactive, iterative design process. Do not disengage from the loop unless the user says they are satisfied. + +1. **Learn** — Run `data-designer agent context`. + - If no model aliases are configured, stop and tell the user to run `data-designer config` to set them up before proceeding. + - Inspect schemas for every column, sampler type, validator, and processor you plan to use. + - Never guess types or parameters — read the relevant config files first. + - Always read `base.py` for inherited fields shared by all config objects. +2. **Clarify** — Ask the user clarifying questions to narrow down precisely what they want. + - Optimize for a great user experience: prefer a structured question tool over plain text if one is available, batch related questions together, keep the set short, provide concrete options/examples/defaults where possible, and use structured inputs (single-select, multi-select, free text, etc.) when they make answering easier. + - If multiple model aliases are available, ask which one(s) to use (or default to an alias with the appropriate `generation_type` for each column). + - Common things to make precise: + - What the "axes of diversity" are — what should be well represented and diverse in the resulting dataset. + - The kind and nature of any input data. + - What variables should be randomized. + - The schema of the final dataset. + - The structure of any required structured output columns. + - What facets of the output dataset are important to capture. +3. **Plan** — Determine columns, samplers, processors, validators, and other dataset features needed. Present the plan to the user and ask if they want any changes before generating a preview. +4. **Build** — Write the Python script with `load_config_builder()` (see Output Template in SKILL.md). +5. **Validate** — Run `data-designer validate `. Address any warnings or errors and re-validate until it passes. +6. **Preview** — Run `data-designer preview --save-results` to generate sample records as HTML files. + - Note the sample records directory printed by the `data-designer preview` command + - Give the user a clickable link: `file:///sample_records_browser.html` +7. **Iterate** — Ask the user for feedback. Edit the script, re-validate, re-preview, and serve again. Repeat until they are satisfied. +8. **Finalize** — Once the user is happy, tell them they can run the following command to create the dataset: + - `data-designer create --num-records --dataset-name `. + - Warn the user that generation can take a long time for large record counts (50+). + - Do not run this command yourself — it can take a long time for large datasets and the user should control when it runs.