From d41ae77c0a202abc761e6dc4914909027c845c47 Mon Sep 17 00:00:00 2001 From: mvansegbroeck Date: Mon, 9 Mar 2026 18:10:09 -0700 Subject: [PATCH] Add llms.txt and llms-full.txt for AI discoverability --- docs/llms-full.txt | 378 +++++++++++++++++++++++++++++++++++++++++++++ docs/llms.txt | 77 +++++++++ llms-full.txt | 378 +++++++++++++++++++++++++++++++++++++++++++++ llms.txt | 77 +++++++++ 4 files changed, 910 insertions(+) create mode 100644 docs/llms-full.txt create mode 100644 docs/llms.txt create mode 100644 llms-full.txt create mode 100644 llms.txt diff --git a/docs/llms-full.txt b/docs/llms-full.txt new file mode 100644 index 000000000..b0e66f207 --- /dev/null +++ b/docs/llms-full.txt @@ -0,0 +1,378 @@ +# NeMo Data Designer + +NeMo Data Designer is an open-source framework for generating high-quality synthetic datasets for LLM training, evaluation, and agent workflows. This document describes the official Python SDK (`pip install data-designer`). For NeMo Microservices deployment and Docker Compose quickstarts, refer to the NVIDIA docs. You can declaratively specify columns, relationships, and quality constraints; execution order, batching, parallelization, and validation are handled for you. Over 150B tokens have been generated using Data Designer across Python 3.10–3.13. + +Version: 0.5.2 (see PyPI for latest release) + +Data Designer extends simple LLM prompting with statistical samplers, dependency-aware generation (DAG-based), validators, LLM-as-a-judge scoring, multi-provider LLM support, structured outputs, and tool use via MCP—all defined in a config separating what you want from how it is built. + +Repository: https://github.com/NVIDIA-NeMo/DataDesigner +Docs: https://nvidia-nemo.github.io/DataDesigner/latest/ +PyPI: https://pypi.org/project/data-designer/ +License: Apache 2.0 +Python: 3.10, 3.11, 3.12, 3.13 + +--- + +## When to use Data Designer + +Use Data Designer when you need: + +- Synthetic datasets with controlled statistical distributions and realistic field correlations. +- Multi-column datasets with dependencies (e.g., review text conditioned on product metadata or SQL queries on schema). +- Code datasets (Python, SQL, etc.) with linting/format checks and LLM-as-a-judge scoring. +- Synthetic eval sets for LLMs/agents, including chat transcripts, tool calls, and traces. +- Reproducible, configurable generation workflows with seed datasets and postprocessing steps. +- Demographically accurate synthetic personas for testing, simulation, or evaluation. + +Data Designer is *not*: + +- A general-purpose LLM framework (for that, see LangChain or LlamaIndex). +- A data labeling or annotation tool (see Label Studio or Prodigy). +- A data anonymization tool (see ARX, Presidio). +- A purely tabular GAN/VAE (see SDV, CTGAN for that). + +--- + +## Common use cases + +- Text-to-Python/code datasets with linted, validated solutions and per-sample quality scores. +- Text-to-SQL across multiple dialects with validators and execution checks. +- Product/support QA pairs, multi-turn conversations, and assistant eval sets. +- Retrieval QA over PDFs or docs using MCP tools for parsing. +- Synthetic tabular data with realistic inter-field correlations (people, customers, transactions) for testing/benchmarking. +- Synthetic eval sets for agent tool use and traces. + +--- + +## Quick start + +### Installation + +```bash +pip install data-designer +``` + +Or from source: + +```bash +git clone https://github.com/NVIDIA-NeMo/DataDesigner.git +cd DataDesigner +make install +``` + +### Set your API key + +Data Designer supports multiple LLM providers. Set one or more: + +```bash +export NVIDIA_API_KEY="your-key" # NVIDIA Build (build.nvidia.com) +export OPENAI_API_KEY="your-key" # OpenAI +export OPENROUTER_API_KEY="your-key" # OpenRouter +``` + +### Generate your first dataset + +The following shows the Python SDK (`data_designer` package). For NeMo Microservices, see their docs for `nemo_microservices.data_designer` usage. + +```python +import data_designer.config as dd +from data_designer.interface import DataDesigner + +# Initialize +data_designer = DataDesigner() +config_builder = dd.DataDesignerConfigBuilder() + +# Add a sampled column +config_builder.add_column( + dd.SamplerColumnConfig( + name="product_category", + sampler_type=dd.SamplerType.CATEGORY, + params=dd.CategorySamplerParams( + values=["Electronics", "Clothing", "Home & Kitchen", "Books"], + ), + ) +) + +# Add an LLM-generated column depending on the sampled column +config_builder.add_column( + dd.LLMTextColumnConfig( + name="review", + model_alias="nvidia-text", + prompt="Write a brief product review for a {{ product_category }} item you recently purchased.", + ) +) + +# Preview a sample +preview = data_designer.preview(config_builder=config_builder) +preview.display_sample_record() + +# Full dataset creation +results = data_designer.create(config_builder=config_builder, num_records=1000) +``` + +### CLI usage + +```bash +data-designer config providers # Configure model providers +data-designer config models # Set up model configs +data-designer config list # View current settings +data-designer preview # Generate preview from config file +data-designer create # Full dataset creation +data-designer validate # Validate configuration +data-designer download personas # Download Nemotron-Personas datasets +``` + +--- + +## Common patterns + +The following examples demonstrate the canonical Data Designer idioms using the real Python SDK API. They show Pydantic schema validation, code validation + judge scoring, SQL and other validators, and demographically controlled person sampling. + +### 1. Sampler + LLM text column + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.SamplerColumnConfig( + name="product_category", + sampler_type=dd.SamplerType.CATEGORY, + params=dd.CategorySamplerParams(values=["Electronics", "Books", "Clothing", "Home"]), + ) +) +builder.add_column( + dd.LLMTextColumnConfig( + name="review", + model_alias="nvidia-text", + prompt="Write a short product review for a {{ product_category }} item.", + ) +) +``` + +### 2. Structured output: LLMStructuredColumnConfig with Pydantic + +```python +from pydantic import BaseModel, Field +import data_designer.config as dd + +class ProductInfo(BaseModel): + name: str = Field(..., min_length=1, max_length=50) + brand: str + price: float + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.LLMStructuredColumnConfig( + name="product_summary", + model_alias="nvidia-text", + prompt="Generate a JSON product summary with fields: name, brand, price.", + output_schema=ProductInfo, + ) +) +``` + +### 3. Code generation with validator and judge scoring + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.LLMCodeColumnConfig( + name="solution_code", + code_lang=dd.CodeLang.PYTHON, + model_alias="nvidia-code", + prompt="Write a Python function that computes the nth Fibonacci number.", + ) +) +builder.add_column( + dd.ValidationColumnConfig( + name="code_lint_result", + validator_type=dd.ValidatorType.CODE, + source_column="solution_code", + params=dd.CodeValidatorParams(code_lang=dd.CodeLang.PYTHON), + ) +) +builder.add_column( + dd.LLMJudgeColumnConfig( + name="code_quality", + model_alias="nvidia-text", + prompt="Rate the quality of this Python solution:\n\n{{ solution_code }}", + scores=[ + dd.Score(name="correctness", description="How correct is the solution?", min_score=1, max_score=5), + dd.Score(name="style", description="How readable and idiomatic is the code?", min_score=1, max_score=5), + dd.Score(name="efficiency", description="How efficient is the algorithm?", min_score=1, max_score=5), + ], + ) +) +``` + +### 4. Text-to-SQL generation with SQL validation + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.LLMCodeColumnConfig( + name="query_sql", + code_lang=dd.CodeLang.SQL, + model_alias="nvidia-code", + prompt="Write a Postgres SQL query to select all orders from the last 7 days.", + ) +) +builder.add_column( + dd.ValidationColumnConfig( + name="sql_check", + validator_type=dd.ValidatorType.CODE, + source_column="query_sql", + params=dd.CodeValidatorParams(code_lang=dd.CodeLang.SQL), + ) +) +``` + +### 5. Person sampling with demographic control (Nemotron-Personas) + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.SamplerColumnConfig( + name="person", + sampler_type=dd.SamplerType.PERSON, + params=dd.PersonSamplerParams( + locale="en_US", + info_types=[ + dd.InfoType.FIRST_NAME, + dd.InfoType.LAST_NAME, + dd.InfoType.AGE, + dd.InfoType.OCCUPATION, + dd.InfoType.EMAIL, + ], + ), + ) +) +# For quick Faker-based generation instead, use SamplerType.PERSON_FROM_FAKER +# with PersonFromFakerSamplerParams. Nemotron-Personas (above) provides +# demographically accurate distributions across 7 locales: en_US, en_IN, +# en_SG, hi_Deva_IN, hi_Latn_IN, ja_JP, pt_BR. +``` + +--- + +## Architecture + +Data Designer is a monorepo with three layers: + +Layer: Config +Package: data-designer-config +Purpose: User-facing configuration API (minimal dependencies) + +Layer: Engine +Package: data-designer-engine +Purpose: Execution engine (LLM integration, DAG management, validation, profiling) + +Layer: Interface +Package: data-designer +Purpose: Public API, CLI, entry point (depends on config + engine) + +### Key design patterns + +- Builder pattern: configure via DataDesignerConfigBuilder (add_column, add_constraint, with_seed_dataset, with_processors) +- DAG-based execution: column dependencies inferred via Jinja2 template refs like `{{ product_category }}` +- Registry/plugin: pluggable column generators, validators, profilers, processors, seed readers +- Strategy pattern: separate handlers (sampler, LLM, expression, seed), dispatched by column type + +### Execution flow + +1. Define columns/constraints with DataDesignerConfigBuilder. +2. Engine builds dependency DAG from column references. +3. Columns generated in topological order with batching/parallelization. +4. Validators run (can gate/score outputs). +5. Results collected with metadata, profiling, traces, artifacts. + +--- + +## Column types (high level) + +Data Designer supports 13+ column types; common ones include: + +- SamplerColumnConfig: statistical samplers (UUID, category, Gaussian, Bernoulli, Poisson, datetime, person/person [Faker], and more) +- LLMTextColumnConfig: free-form text (Jinja2 prompts, system prompts, traces) +- LLMCodeColumnConfig: code generation for Python, JS, Java, Go, Rust, SQL, etc. +- LLMStructuredColumnConfig: JSON-structured output, validated against Pydantic/JSON schema +- LLMJudgeColumnConfig: LLM-as-a-judge scoring (multi-dimensional rubric) +- ImageColumnConfig: diffusion/autoregressive image gen from prompt/context +- EmbeddingColumnConfig: vector embeddings from text columns +- ExpressionColumnConfig: derived columns via Jinja2 expressions +- ValidationColumnConfig: validators (code, SQL, HTTP, custom callables) +- SeedDatasetColumnConfig: seed-based generation from CSV/Parquet/JSON, Hugging Face datasets, or DataFrames +- CustomColumnConfig: user-defined via custom_column_generator decorator + +Full column concept docs: +https://nvidia-nemo.github.io/DataDesigner/latest/concepts/columns/ + +--- + +## Models and providers + +- ModelProvider: defines backend endpoint (NVIDIA Build, OpenAI, OpenRouter, or any LiteLLM-compatible server) +- ModelConfig: model name/alias, inference params (temperature, top_p, max_tokens, etc.) +- Distribution-based params let you sample temperature/other options to boost output diversity + +Default providers: NVIDIA Build (`NVIDIA_API_KEY`), OpenAI (`OPENAI_API_KEY`), OpenRouter (`OPENROUTER_API_KEY`). Any LiteLLM-compatible endpoint can be registered. + +Python & CLI examples above use the standalone package. For Microservices, see `nemo_microservices` docs. + +More model docs: +https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/ + +--- + +## Constraints, validation, and MCP + +- Constraints: scalar and column inequality (e.g., salary > 0, max_salary > min_salary) +- Validators: code, SQL, remote, and local callables as DAG columns +- MCP & tool use: config for tool-calling by LLM columns (e.g., file reads, API queries) + +Docs: +https://nvidia-nemo.github.io/DataDesigner/latest/concepts/tool_use_and_mcp/ + +--- + +## Traces, processors, and results + +- Traces: none, last message, or full conversation (each as columns) +- Processors: post-gen transforms (e.g., drop intermediate columns, schema transforms) +- Results: preview/full create APIs return generated records, metadata, profiles, traces, artifacts + +--- + +## Plugins + +Plugin architecture for: + +- Column generators +- Validators +- Profilers +- Processors +- Seed readers + +Plugins discovered via Python entrypoints. See: +https://nvidia-nemo.github.io/DataDesigner/latest/plugins/overview/ +https://nvidia-nemo.github.io/DataDesigner/latest/plugins/available/ + +--- + +## Telemetry + +Data Designer collects anonymous telemetry (model names, token counts only; no user/device IDs). + +To disable: + +```bash +export NEMO_TELEMETRY_ENABLED=false +``` \ No newline at end of file diff --git a/docs/llms.txt b/docs/llms.txt new file mode 100644 index 000000000..73e3c7982 --- /dev/null +++ b/docs/llms.txt @@ -0,0 +1,77 @@ +# NeMo Data Designer + +NeMo Data Designer is an open-source framework for generating high-quality synthetic datasets for LLM training, evaluation, and agent workflows. This description refers to the Python package installed with `pip install data-designer`; see the NVIDIA NeMo microservices docs for the hosted deployment API. It combines statistical sampling, DAG-based dependency handling, validation, structured outputs, and tool-augmented generation so you can declaratively specify and reproducibly generate the data you want at scale. + +Use Data Designer when you need multi-column synthetic data where fields depend on each other (e.g., product reviews conditioned on product metadata, text-to-SQL pairs, code solutions with tests and lint results, multi-turn chat transcripts, or QA pairs grounded in documents and tool outputs). It is designed for constructing datasets for evaluation, fine-tuning, RAG and retrieval QA, tool/agent training, and regression testing. + +Install with `pip install data-designer`. Requires an API key for NVIDIA Build, OpenAI, or OpenRouter. + +## What it does (for agents and tools) + +- Generates synthetic tabular, text, code, chat, and image data with tunable statistical distributions and realistic correlations between columns. +- Uses a DAG-based engine to resolve column dependencies automatically from Jinja2-style references like `{{ product_category }}`. +- Supports validation via Python (Ruff), SQL (SQLFluff), remote HTTP validators, and custom callables, plus LLM-as-a-judge scoring columns. +- Captures traces of LLM calls (including message history) alongside outputs for debugging, inspection, and analysis. +- Integrates with the Model Context Protocol (MCP) so LLM-generated columns can call external tools (e.g., file readers, HTTP APIs) during generation. + +## Core concepts + +- Column types: sampler, LLM text, LLM code, LLM structured (JSON/Pydantic), LLM judge, image, embedding, expression, validation, seed-based, and custom generators. +- Seed datasets: bootstrap from CSV, Parquet, JSON, Hugging Face datasets, or pandas DataFrames. +- Validators: configure code, SQL, remote HTTP, and local callable validators as columns in the same configuration graph. +- Person sampling: generate demographically accurate synthetic personas (using Nemotron-Personas, 7+ locales) or Faker-based person data. +- Traces: opt in to capturing partial or full LLM message history as sidecar columns. +- Processors: apply post-generation transformations like dropping intermediate columns or renaming fields. + +## Models and providers + +- The package ships with default model providers for NVIDIA Build, OpenAI, and OpenRouter; any LiteLLM-compatible endpoint can also be configured as a custom provider. +- Model configuration is separate from dataset configuration: you define `ModelProvider` objects (URLs, API keys) and `ModelConfig` objects (model IDs, inference params). +- Inference parameters such as temperature, top_p, and max_tokens can be fixed or sampled from distributions to control diversity. + +## MCP and tool-augmented generation + +- MCP providers: configure local or remote MCP servers for tool discovery. +- Tool configs: choose which MCP tools are visible to a given LLM column. +- Safety and limits: restrict which tools can be called, how often, and with what arguments. + +## Common use cases people search for + +- Text-to-Python / text-to-code datasets with linted, validated solutions and per-sample quality scores. +- Text-to-SQL datasets across multiple SQL dialects with validators and execution checks. +- Product and support QA pairs, multi-turn chat conversations, and assistant evaluation sets. +- Retrieval QA over PDFs and other documents using MCP tools for retrieval and parsing. +- Synthetic tabular datasets with realistic correlations (e.g., people, customers, transactions) for testing and benchmarking. +- Synthetic eval sets for agents that need tool calls and traces. + +## Tutorials + +- The Basics: install, configure, and generate your first dataset with samplers and LLM columns. +- Structured outputs and Jinja expressions: JSON schema–validated generation and expression columns. +- Seeding with a dataset: generate synthetic variations from existing data. +- Images as context, image generation, and image-to-image editing: multimodal generation and editing workflows. + +Full tutorials: https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/ + +## Recipes + +- Text-to-Python and text-to-SQL code generation with validation. +- Product info QA and multi-turn chat dataset generation. +- Basic MCP and PDF QA recipes for tool-augmented generation. + +Recipes: https://nvidia-nemo.github.io/DataDesigner/latest/recipes/ + +## Code reference and plugins + +- Config builder, column configs, sampler parameters, models, validators, processors, MCP integration, and analysis utilities are all documented in the code reference. +- Plugin system for column generators, validators, profilers, processors, and seed readers. + +Code reference: https://nvidia-nemo.github.io/DataDesigner/latest/code_reference/ +Plugins: https://nvidia-nemo.github.io/DataDesigner/latest/plugins/available/ + +## Project links + +- Documentation: https://nvidia-nemo.github.io/DataDesigner/latest/ +- GitHub: https://github.com/NVIDIA-NeMo/DataDesigner +- PyPI: https://pypi.org/project/data-designer/ +- Deployment options: https://nvidia-nemo.github.io/DataDesigner/latest/concepts/deployment-options/ diff --git a/llms-full.txt b/llms-full.txt new file mode 100644 index 000000000..b0e66f207 --- /dev/null +++ b/llms-full.txt @@ -0,0 +1,378 @@ +# NeMo Data Designer + +NeMo Data Designer is an open-source framework for generating high-quality synthetic datasets for LLM training, evaluation, and agent workflows. This document describes the official Python SDK (`pip install data-designer`). For NeMo Microservices deployment and Docker Compose quickstarts, refer to the NVIDIA docs. You can declaratively specify columns, relationships, and quality constraints; execution order, batching, parallelization, and validation are handled for you. Over 150B tokens have been generated using Data Designer across Python 3.10–3.13. + +Version: 0.5.2 (see PyPI for latest release) + +Data Designer extends simple LLM prompting with statistical samplers, dependency-aware generation (DAG-based), validators, LLM-as-a-judge scoring, multi-provider LLM support, structured outputs, and tool use via MCP—all defined in a config separating what you want from how it is built. + +Repository: https://github.com/NVIDIA-NeMo/DataDesigner +Docs: https://nvidia-nemo.github.io/DataDesigner/latest/ +PyPI: https://pypi.org/project/data-designer/ +License: Apache 2.0 +Python: 3.10, 3.11, 3.12, 3.13 + +--- + +## When to use Data Designer + +Use Data Designer when you need: + +- Synthetic datasets with controlled statistical distributions and realistic field correlations. +- Multi-column datasets with dependencies (e.g., review text conditioned on product metadata or SQL queries on schema). +- Code datasets (Python, SQL, etc.) with linting/format checks and LLM-as-a-judge scoring. +- Synthetic eval sets for LLMs/agents, including chat transcripts, tool calls, and traces. +- Reproducible, configurable generation workflows with seed datasets and postprocessing steps. +- Demographically accurate synthetic personas for testing, simulation, or evaluation. + +Data Designer is *not*: + +- A general-purpose LLM framework (for that, see LangChain or LlamaIndex). +- A data labeling or annotation tool (see Label Studio or Prodigy). +- A data anonymization tool (see ARX, Presidio). +- A purely tabular GAN/VAE (see SDV, CTGAN for that). + +--- + +## Common use cases + +- Text-to-Python/code datasets with linted, validated solutions and per-sample quality scores. +- Text-to-SQL across multiple dialects with validators and execution checks. +- Product/support QA pairs, multi-turn conversations, and assistant eval sets. +- Retrieval QA over PDFs or docs using MCP tools for parsing. +- Synthetic tabular data with realistic inter-field correlations (people, customers, transactions) for testing/benchmarking. +- Synthetic eval sets for agent tool use and traces. + +--- + +## Quick start + +### Installation + +```bash +pip install data-designer +``` + +Or from source: + +```bash +git clone https://github.com/NVIDIA-NeMo/DataDesigner.git +cd DataDesigner +make install +``` + +### Set your API key + +Data Designer supports multiple LLM providers. Set one or more: + +```bash +export NVIDIA_API_KEY="your-key" # NVIDIA Build (build.nvidia.com) +export OPENAI_API_KEY="your-key" # OpenAI +export OPENROUTER_API_KEY="your-key" # OpenRouter +``` + +### Generate your first dataset + +The following shows the Python SDK (`data_designer` package). For NeMo Microservices, see their docs for `nemo_microservices.data_designer` usage. + +```python +import data_designer.config as dd +from data_designer.interface import DataDesigner + +# Initialize +data_designer = DataDesigner() +config_builder = dd.DataDesignerConfigBuilder() + +# Add a sampled column +config_builder.add_column( + dd.SamplerColumnConfig( + name="product_category", + sampler_type=dd.SamplerType.CATEGORY, + params=dd.CategorySamplerParams( + values=["Electronics", "Clothing", "Home & Kitchen", "Books"], + ), + ) +) + +# Add an LLM-generated column depending on the sampled column +config_builder.add_column( + dd.LLMTextColumnConfig( + name="review", + model_alias="nvidia-text", + prompt="Write a brief product review for a {{ product_category }} item you recently purchased.", + ) +) + +# Preview a sample +preview = data_designer.preview(config_builder=config_builder) +preview.display_sample_record() + +# Full dataset creation +results = data_designer.create(config_builder=config_builder, num_records=1000) +``` + +### CLI usage + +```bash +data-designer config providers # Configure model providers +data-designer config models # Set up model configs +data-designer config list # View current settings +data-designer preview # Generate preview from config file +data-designer create # Full dataset creation +data-designer validate # Validate configuration +data-designer download personas # Download Nemotron-Personas datasets +``` + +--- + +## Common patterns + +The following examples demonstrate the canonical Data Designer idioms using the real Python SDK API. They show Pydantic schema validation, code validation + judge scoring, SQL and other validators, and demographically controlled person sampling. + +### 1. Sampler + LLM text column + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.SamplerColumnConfig( + name="product_category", + sampler_type=dd.SamplerType.CATEGORY, + params=dd.CategorySamplerParams(values=["Electronics", "Books", "Clothing", "Home"]), + ) +) +builder.add_column( + dd.LLMTextColumnConfig( + name="review", + model_alias="nvidia-text", + prompt="Write a short product review for a {{ product_category }} item.", + ) +) +``` + +### 2. Structured output: LLMStructuredColumnConfig with Pydantic + +```python +from pydantic import BaseModel, Field +import data_designer.config as dd + +class ProductInfo(BaseModel): + name: str = Field(..., min_length=1, max_length=50) + brand: str + price: float + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.LLMStructuredColumnConfig( + name="product_summary", + model_alias="nvidia-text", + prompt="Generate a JSON product summary with fields: name, brand, price.", + output_schema=ProductInfo, + ) +) +``` + +### 3. Code generation with validator and judge scoring + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.LLMCodeColumnConfig( + name="solution_code", + code_lang=dd.CodeLang.PYTHON, + model_alias="nvidia-code", + prompt="Write a Python function that computes the nth Fibonacci number.", + ) +) +builder.add_column( + dd.ValidationColumnConfig( + name="code_lint_result", + validator_type=dd.ValidatorType.CODE, + source_column="solution_code", + params=dd.CodeValidatorParams(code_lang=dd.CodeLang.PYTHON), + ) +) +builder.add_column( + dd.LLMJudgeColumnConfig( + name="code_quality", + model_alias="nvidia-text", + prompt="Rate the quality of this Python solution:\n\n{{ solution_code }}", + scores=[ + dd.Score(name="correctness", description="How correct is the solution?", min_score=1, max_score=5), + dd.Score(name="style", description="How readable and idiomatic is the code?", min_score=1, max_score=5), + dd.Score(name="efficiency", description="How efficient is the algorithm?", min_score=1, max_score=5), + ], + ) +) +``` + +### 4. Text-to-SQL generation with SQL validation + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.LLMCodeColumnConfig( + name="query_sql", + code_lang=dd.CodeLang.SQL, + model_alias="nvidia-code", + prompt="Write a Postgres SQL query to select all orders from the last 7 days.", + ) +) +builder.add_column( + dd.ValidationColumnConfig( + name="sql_check", + validator_type=dd.ValidatorType.CODE, + source_column="query_sql", + params=dd.CodeValidatorParams(code_lang=dd.CodeLang.SQL), + ) +) +``` + +### 5. Person sampling with demographic control (Nemotron-Personas) + +```python +import data_designer.config as dd + +builder = dd.DataDesignerConfigBuilder() +builder.add_column( + dd.SamplerColumnConfig( + name="person", + sampler_type=dd.SamplerType.PERSON, + params=dd.PersonSamplerParams( + locale="en_US", + info_types=[ + dd.InfoType.FIRST_NAME, + dd.InfoType.LAST_NAME, + dd.InfoType.AGE, + dd.InfoType.OCCUPATION, + dd.InfoType.EMAIL, + ], + ), + ) +) +# For quick Faker-based generation instead, use SamplerType.PERSON_FROM_FAKER +# with PersonFromFakerSamplerParams. Nemotron-Personas (above) provides +# demographically accurate distributions across 7 locales: en_US, en_IN, +# en_SG, hi_Deva_IN, hi_Latn_IN, ja_JP, pt_BR. +``` + +--- + +## Architecture + +Data Designer is a monorepo with three layers: + +Layer: Config +Package: data-designer-config +Purpose: User-facing configuration API (minimal dependencies) + +Layer: Engine +Package: data-designer-engine +Purpose: Execution engine (LLM integration, DAG management, validation, profiling) + +Layer: Interface +Package: data-designer +Purpose: Public API, CLI, entry point (depends on config + engine) + +### Key design patterns + +- Builder pattern: configure via DataDesignerConfigBuilder (add_column, add_constraint, with_seed_dataset, with_processors) +- DAG-based execution: column dependencies inferred via Jinja2 template refs like `{{ product_category }}` +- Registry/plugin: pluggable column generators, validators, profilers, processors, seed readers +- Strategy pattern: separate handlers (sampler, LLM, expression, seed), dispatched by column type + +### Execution flow + +1. Define columns/constraints with DataDesignerConfigBuilder. +2. Engine builds dependency DAG from column references. +3. Columns generated in topological order with batching/parallelization. +4. Validators run (can gate/score outputs). +5. Results collected with metadata, profiling, traces, artifacts. + +--- + +## Column types (high level) + +Data Designer supports 13+ column types; common ones include: + +- SamplerColumnConfig: statistical samplers (UUID, category, Gaussian, Bernoulli, Poisson, datetime, person/person [Faker], and more) +- LLMTextColumnConfig: free-form text (Jinja2 prompts, system prompts, traces) +- LLMCodeColumnConfig: code generation for Python, JS, Java, Go, Rust, SQL, etc. +- LLMStructuredColumnConfig: JSON-structured output, validated against Pydantic/JSON schema +- LLMJudgeColumnConfig: LLM-as-a-judge scoring (multi-dimensional rubric) +- ImageColumnConfig: diffusion/autoregressive image gen from prompt/context +- EmbeddingColumnConfig: vector embeddings from text columns +- ExpressionColumnConfig: derived columns via Jinja2 expressions +- ValidationColumnConfig: validators (code, SQL, HTTP, custom callables) +- SeedDatasetColumnConfig: seed-based generation from CSV/Parquet/JSON, Hugging Face datasets, or DataFrames +- CustomColumnConfig: user-defined via custom_column_generator decorator + +Full column concept docs: +https://nvidia-nemo.github.io/DataDesigner/latest/concepts/columns/ + +--- + +## Models and providers + +- ModelProvider: defines backend endpoint (NVIDIA Build, OpenAI, OpenRouter, or any LiteLLM-compatible server) +- ModelConfig: model name/alias, inference params (temperature, top_p, max_tokens, etc.) +- Distribution-based params let you sample temperature/other options to boost output diversity + +Default providers: NVIDIA Build (`NVIDIA_API_KEY`), OpenAI (`OPENAI_API_KEY`), OpenRouter (`OPENROUTER_API_KEY`). Any LiteLLM-compatible endpoint can be registered. + +Python & CLI examples above use the standalone package. For Microservices, see `nemo_microservices` docs. + +More model docs: +https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/ + +--- + +## Constraints, validation, and MCP + +- Constraints: scalar and column inequality (e.g., salary > 0, max_salary > min_salary) +- Validators: code, SQL, remote, and local callables as DAG columns +- MCP & tool use: config for tool-calling by LLM columns (e.g., file reads, API queries) + +Docs: +https://nvidia-nemo.github.io/DataDesigner/latest/concepts/tool_use_and_mcp/ + +--- + +## Traces, processors, and results + +- Traces: none, last message, or full conversation (each as columns) +- Processors: post-gen transforms (e.g., drop intermediate columns, schema transforms) +- Results: preview/full create APIs return generated records, metadata, profiles, traces, artifacts + +--- + +## Plugins + +Plugin architecture for: + +- Column generators +- Validators +- Profilers +- Processors +- Seed readers + +Plugins discovered via Python entrypoints. See: +https://nvidia-nemo.github.io/DataDesigner/latest/plugins/overview/ +https://nvidia-nemo.github.io/DataDesigner/latest/plugins/available/ + +--- + +## Telemetry + +Data Designer collects anonymous telemetry (model names, token counts only; no user/device IDs). + +To disable: + +```bash +export NEMO_TELEMETRY_ENABLED=false +``` \ No newline at end of file diff --git a/llms.txt b/llms.txt new file mode 100644 index 000000000..73e3c7982 --- /dev/null +++ b/llms.txt @@ -0,0 +1,77 @@ +# NeMo Data Designer + +NeMo Data Designer is an open-source framework for generating high-quality synthetic datasets for LLM training, evaluation, and agent workflows. This description refers to the Python package installed with `pip install data-designer`; see the NVIDIA NeMo microservices docs for the hosted deployment API. It combines statistical sampling, DAG-based dependency handling, validation, structured outputs, and tool-augmented generation so you can declaratively specify and reproducibly generate the data you want at scale. + +Use Data Designer when you need multi-column synthetic data where fields depend on each other (e.g., product reviews conditioned on product metadata, text-to-SQL pairs, code solutions with tests and lint results, multi-turn chat transcripts, or QA pairs grounded in documents and tool outputs). It is designed for constructing datasets for evaluation, fine-tuning, RAG and retrieval QA, tool/agent training, and regression testing. + +Install with `pip install data-designer`. Requires an API key for NVIDIA Build, OpenAI, or OpenRouter. + +## What it does (for agents and tools) + +- Generates synthetic tabular, text, code, chat, and image data with tunable statistical distributions and realistic correlations between columns. +- Uses a DAG-based engine to resolve column dependencies automatically from Jinja2-style references like `{{ product_category }}`. +- Supports validation via Python (Ruff), SQL (SQLFluff), remote HTTP validators, and custom callables, plus LLM-as-a-judge scoring columns. +- Captures traces of LLM calls (including message history) alongside outputs for debugging, inspection, and analysis. +- Integrates with the Model Context Protocol (MCP) so LLM-generated columns can call external tools (e.g., file readers, HTTP APIs) during generation. + +## Core concepts + +- Column types: sampler, LLM text, LLM code, LLM structured (JSON/Pydantic), LLM judge, image, embedding, expression, validation, seed-based, and custom generators. +- Seed datasets: bootstrap from CSV, Parquet, JSON, Hugging Face datasets, or pandas DataFrames. +- Validators: configure code, SQL, remote HTTP, and local callable validators as columns in the same configuration graph. +- Person sampling: generate demographically accurate synthetic personas (using Nemotron-Personas, 7+ locales) or Faker-based person data. +- Traces: opt in to capturing partial or full LLM message history as sidecar columns. +- Processors: apply post-generation transformations like dropping intermediate columns or renaming fields. + +## Models and providers + +- The package ships with default model providers for NVIDIA Build, OpenAI, and OpenRouter; any LiteLLM-compatible endpoint can also be configured as a custom provider. +- Model configuration is separate from dataset configuration: you define `ModelProvider` objects (URLs, API keys) and `ModelConfig` objects (model IDs, inference params). +- Inference parameters such as temperature, top_p, and max_tokens can be fixed or sampled from distributions to control diversity. + +## MCP and tool-augmented generation + +- MCP providers: configure local or remote MCP servers for tool discovery. +- Tool configs: choose which MCP tools are visible to a given LLM column. +- Safety and limits: restrict which tools can be called, how often, and with what arguments. + +## Common use cases people search for + +- Text-to-Python / text-to-code datasets with linted, validated solutions and per-sample quality scores. +- Text-to-SQL datasets across multiple SQL dialects with validators and execution checks. +- Product and support QA pairs, multi-turn chat conversations, and assistant evaluation sets. +- Retrieval QA over PDFs and other documents using MCP tools for retrieval and parsing. +- Synthetic tabular datasets with realistic correlations (e.g., people, customers, transactions) for testing and benchmarking. +- Synthetic eval sets for agents that need tool calls and traces. + +## Tutorials + +- The Basics: install, configure, and generate your first dataset with samplers and LLM columns. +- Structured outputs and Jinja expressions: JSON schema–validated generation and expression columns. +- Seeding with a dataset: generate synthetic variations from existing data. +- Images as context, image generation, and image-to-image editing: multimodal generation and editing workflows. + +Full tutorials: https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/ + +## Recipes + +- Text-to-Python and text-to-SQL code generation with validation. +- Product info QA and multi-turn chat dataset generation. +- Basic MCP and PDF QA recipes for tool-augmented generation. + +Recipes: https://nvidia-nemo.github.io/DataDesigner/latest/recipes/ + +## Code reference and plugins + +- Config builder, column configs, sampler parameters, models, validators, processors, MCP integration, and analysis utilities are all documented in the code reference. +- Plugin system for column generators, validators, profilers, processors, and seed readers. + +Code reference: https://nvidia-nemo.github.io/DataDesigner/latest/code_reference/ +Plugins: https://nvidia-nemo.github.io/DataDesigner/latest/plugins/available/ + +## Project links + +- Documentation: https://nvidia-nemo.github.io/DataDesigner/latest/ +- GitHub: https://github.com/NVIDIA-NeMo/DataDesigner +- PyPI: https://pypi.org/project/data-designer/ +- Deployment options: https://nvidia-nemo.github.io/DataDesigner/latest/concepts/deployment-options/