docs: add FileSystemSeedReader authoring guide and Markdown recipe by eric-tramel · Pull Request #425 · NVIDIA-NeMo/DataDesigner

eric-tramel · 2026-03-17T01:50:09Z

Summary

This PR documents how to author FileSystemSeedReader plugins on top of the 1:N hydration contract introduced in #424.

add a dedicated FileSystemSeedReader plugin authoring guide
add a new Plugin Development recipe with a runnable Markdown section seed reader scaffold
add a focused tests_e2e smoke test for manifest-based selection with fanout hydration
wire the new docs into plugin and recipe navigation

Why

#424 makes hydrate_row() capable of returning either one record or many records per manifest row. This follow-up PR explains that contract for plugin authors and gives them a concrete example that splits Markdown files into section rows.

Blocked By

feat: support 1-to-many FileSystemSeedReader hydration #424

This PR is intentionally stacked on top of feat/filesystem-seed-reader-fanout so the review only contains the docs/example changes. After #424 merges, retarget this PR to main.

Testing

uv run ruff check docs/assets/recipes/plugin_development/markdown_seed_reader tests_e2e/src/data_designer_e2e_tests/plugins/markdown_seed_reader tests_e2e/tests/test_e2e.py
uv run pytest tests/test_e2e.py -k markdown_section_seed_reader_plugin_fanout_respects_manifest_selection (from tests_e2e/)
UV_CACHE_DIR=/tmp/uv-cache uv run --group docs mkdocs build --strict (currently still fails on pre-existing repo-wide docs warnings unrelated to this branch)

greptile-apps · 2026-03-17T01:54:45Z

Greptile Summary

This PR adds documentation for authoring FileSystemSeedReader plugins on top of the 1:N hydration contract introduced in #424. It introduces a dedicated guide (docs/plugins/filesystem_seed_reader.md), a runnable single-file recipe (docs/assets/recipes/plugin_development/markdown_seed_reader.py) that splits Markdown files into per-section seed rows, and wires both into the MkDocs navigation.

Key changes:

New FileSystemSeedReader authoring guide covering build_manifest, hydrate_row, manifest-based selection semantics, and packaging steps
Self-contained Markdown section seed reader recipe demonstrating the 1:N fanout pattern with DirectorySeedSource
docs/plugins/example.md and docs/plugins/overview.md updated to cross-link the new guide and reflect the three-plugin-type model
New Plugin Development nav group added to mkdocs.yml

Issues found:

The guide's output_columns code snippet omits the ClassVar[list[str]] type annotation that the recipe file uses correctly — authors copying the snippet may get unexpected Pydantic behavior or type-checker warnings
An empty or whitespace-only .md file causes hydrate_row to return []; whether the framework silently drops those manifest rows or raises an error is undocumented and untested, which could lead to silent data loss

Confidence Score: 4/5

Safe to merge after addressing the ClassVar annotation inconsistency and clarifying 0-row hydration behavior.
All seven changed files are documentation and a self-contained recipe script. The recipe logic is correct for the documented sample inputs, navigation wiring is accurate, and mkdocs snippet paths resolve correctly. Two minor issues — a missing ClassVar annotation in a guide snippet and an undocumented/untested empty-file edge case — prevent a perfect score but do not block the PR.
docs/plugins/filesystem_seed_reader.md (ClassVar annotation in code snippet) and docs/assets/recipes/plugin_development/markdown_seed_reader.py (empty-file hydration edge case)

Important Files Changed

Filename	Overview
docs/assets/recipes/plugin_development/markdown_seed_reader.py	Self-contained recipe implementing MarkdownSectionDirectorySeedReader with 1:N hydration. Logic is sound for the documented sample files; minor concern about how the framework handles an empty-file manifest row that hydrates to zero records.
docs/plugins/filesystem_seed_reader.md	New authoring guide covering build_manifest/hydrate_row contract, manifest-based selection semantics, and packaging steps. The guide's output_columns code snippet omits the ClassVar annotation that the recipe file uses correctly, creating a subtle inconsistency for copy-pasting authors.
docs/recipes/plugin_development/markdown_seed_reader.md	Short recipe landing page that embeds the Python file via mkdocs snippets. Path reference and download link look correct.
docs/plugins/overview.md	Adds FileSystemSeedReader mention to the seed reader implementation list and updates the closing navigation links. Changes are accurate and consistent with the rest of the PR.
docs/plugins/example.md	Updates the supported plugin-type count from two to three and adds a cross-link to the new FileSystemSeedReader guide. Small, correct change.
docs/recipes/cards.md	Appends the Markdown Section Seed Reader card to the recipes gallery with correct relative links and download path.
mkdocs.yml	Adds a new Plugin Development nav section for the recipe and wires in the FileSystemSeedReader Plugins page under the Plugins nav group. Navigation is correctly ordered.

Sequence Diagram

sequenceDiagram
    participant U as User / Recipe
    participant DD as DataDesigner
    participant R as MarkdownSectionDirectorySeedReader
    participant FS as FileSystem

    U->>DD: preview(config_builder, num_records=N)
    DD->>R: build_manifest(context)
    R->>FS: get_matching_relative_paths("*.md")
    FS-->>R: ["faq.md", "guide.md"]
    R-->>DD: manifest [{relative_path, file_name}, ...]

    note over DD: Apply IndexRange / shuffle on manifest rows

    loop for each selected manifest row
        DD->>R: hydrate_row(manifest_row, context)
        R->>FS: open(relative_path)
        FS-->>R: markdown text
        R->>R: extract_markdown_sections(markdown_text)
        R-->>DD: [section_row_1, section_row_2, ...]
    end

    DD->>DD: flatten hydrated rows, validate output_columns
    DD-->>U: preview.dataset (one row per section)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/plugins/filesystem_seed_reader.md
Line: 37-44

Comment:
**`output_columns` missing `ClassVar` annotation in guide snippet**

The docs guide declares `output_columns` without a type annotation, while the actual recipe file (`docs/assets/recipes/plugin_development/markdown_seed_reader.py`, line 40) correctly uses `ClassVar[list[str]]`. Without `ClassVar`, type-checkers (mypy, pyright) will treat this as a regular instance variable rather than a class variable, and Pydantic may inadvertently include it as a model field. Keeping the guide snippet consistent with the recipe avoids confusion for authors who copy it.

```suggestion
    output_columns: ClassVar[list[str]] = [
        "relative_path",
        "file_name",
        "section_index",
        "section_header",
        "section_content",
    ]
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/assets/recipes/plugin_development/markdown_seed_reader.py
Line: 108-110

Comment:
**Empty/whitespace-only file produces zero hydrated rows**

When a matched `.md` file is empty or contains only whitespace, `extract_markdown_sections` returns `[]`, which causes `hydrate_row` to return an empty list. Whether the framework silently drops a manifest row that yields zero hydrated records — or raises an error — is not described in this PR, and no test covers the empty-file path.

If the framework does silently skip 0-row hydrations, that could result in unexpected data loss (e.g. a Markdown file quietly disappearing from the seed dataset). Consider either:
- documenting this behavior explicitly in the guide, or
- adding an early guard that returns a single "empty document" row when no sections are found (consistent with the `fallback_header` contract already used for headerless files).

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: "Merge branch 'main' ..."}

tests_e2e/src/data_designer_e2e_tests/plugins/markdown_seed_reader/config.py

eric-tramel requested a review from a team as a code owner March 17, 2026 01:50

eric-tramel self-assigned this Mar 17, 2026

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

tests_e2e/src/data_designer_e2e_tests/plugins/markdown_seed_reader/config.py Outdated Show resolved Hide resolved

eric-tramel added 2 commits March 17, 2026 21:01

docs: add FileSystemSeedReader authoring guide and Markdown recipe

f120280

test: drop redundant markdown seed reader e2e coverage

3580fba

eric-tramel force-pushed the docs/filesystem-seed-reader-markdown-recipe branch from 09cf5d5 to 3580fba Compare March 18, 2026 01:14

eric-tramel added 2 commits March 17, 2026 21:30

docs: simplify markdown seed reader recipe

92c7101

docs: clarify markdown seed reader recipe docstrings

0d4628e

eric-tramel changed the base branch from feat/filesystem-seed-reader-fanout to main March 18, 2026 01:42

Merge branch 'main' into docs/filesystem-seed-reader-markdown-recipe

ded2ab2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add FileSystemSeedReader authoring guide and Markdown recipe#425

docs: add FileSystemSeedReader authoring guide and Markdown recipe#425
eric-tramel wants to merge 5 commits intomainfrom
docs/filesystem-seed-reader-markdown-recipe

eric-tramel commented Mar 17, 2026

Uh oh!

greptile-apps bot commented Mar 17, 2026 •

edited

Loading

Confidence Score: 4/5

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eric-tramel commented Mar 17, 2026

Summary

Why

Blocked By

Testing

Uh oh!

greptile-apps bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 17, 2026 •

edited

Loading