docs: add FileSystemSeedReader authoring guide and Markdown recipe#425
Open
eric-tramel wants to merge 5 commits intomainfrom
Open
docs: add FileSystemSeedReader authoring guide and Markdown recipe#425eric-tramel wants to merge 5 commits intomainfrom
eric-tramel wants to merge 5 commits intomainfrom
Conversation
Contributor
Greptile SummaryThis PR adds documentation for authoring Key changes:
Issues found:
|
| Filename | Overview |
|---|---|
| docs/assets/recipes/plugin_development/markdown_seed_reader.py | Self-contained recipe implementing MarkdownSectionDirectorySeedReader with 1:N hydration. Logic is sound for the documented sample files; minor concern about how the framework handles an empty-file manifest row that hydrates to zero records. |
| docs/plugins/filesystem_seed_reader.md | New authoring guide covering build_manifest/hydrate_row contract, manifest-based selection semantics, and packaging steps. The guide's output_columns code snippet omits the ClassVar annotation that the recipe file uses correctly, creating a subtle inconsistency for copy-pasting authors. |
| docs/recipes/plugin_development/markdown_seed_reader.md | Short recipe landing page that embeds the Python file via mkdocs snippets. Path reference and download link look correct. |
| docs/plugins/overview.md | Adds FileSystemSeedReader mention to the seed reader implementation list and updates the closing navigation links. Changes are accurate and consistent with the rest of the PR. |
| docs/plugins/example.md | Updates the supported plugin-type count from two to three and adds a cross-link to the new FileSystemSeedReader guide. Small, correct change. |
| docs/recipes/cards.md | Appends the Markdown Section Seed Reader card to the recipes gallery with correct relative links and download path. |
| mkdocs.yml | Adds a new Plugin Development nav section for the recipe and wires in the FileSystemSeedReader Plugins page under the Plugins nav group. Navigation is correctly ordered. |
Sequence Diagram
sequenceDiagram
participant U as User / Recipe
participant DD as DataDesigner
participant R as MarkdownSectionDirectorySeedReader
participant FS as FileSystem
U->>DD: preview(config_builder, num_records=N)
DD->>R: build_manifest(context)
R->>FS: get_matching_relative_paths("*.md")
FS-->>R: ["faq.md", "guide.md"]
R-->>DD: manifest [{relative_path, file_name}, ...]
note over DD: Apply IndexRange / shuffle on manifest rows
loop for each selected manifest row
DD->>R: hydrate_row(manifest_row, context)
R->>FS: open(relative_path)
FS-->>R: markdown text
R->>R: extract_markdown_sections(markdown_text)
R-->>DD: [section_row_1, section_row_2, ...]
end
DD->>DD: flatten hydrated rows, validate output_columns
DD-->>U: preview.dataset (one row per section)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/plugins/filesystem_seed_reader.md
Line: 37-44
Comment:
**`output_columns` missing `ClassVar` annotation in guide snippet**
The docs guide declares `output_columns` without a type annotation, while the actual recipe file (`docs/assets/recipes/plugin_development/markdown_seed_reader.py`, line 40) correctly uses `ClassVar[list[str]]`. Without `ClassVar`, type-checkers (mypy, pyright) will treat this as a regular instance variable rather than a class variable, and Pydantic may inadvertently include it as a model field. Keeping the guide snippet consistent with the recipe avoids confusion for authors who copy it.
```suggestion
output_columns: ClassVar[list[str]] = [
"relative_path",
"file_name",
"section_index",
"section_header",
"section_content",
]
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: docs/assets/recipes/plugin_development/markdown_seed_reader.py
Line: 108-110
Comment:
**Empty/whitespace-only file produces zero hydrated rows**
When a matched `.md` file is empty or contains only whitespace, `extract_markdown_sections` returns `[]`, which causes `hydrate_row` to return an empty list. Whether the framework silently drops a manifest row that yields zero hydrated records — or raises an error — is not described in this PR, and no test covers the empty-file path.
If the framework does silently skip 0-row hydrations, that could result in unexpected data loss (e.g. a Markdown file quietly disappearing from the seed dataset). Consider either:
- documenting this behavior explicitly in the guide, or
- adding an early guard that returns a single "empty document" row when no sections are found (consistent with the `fallback_header` contract already used for headerless files).
How can I resolve this? If you propose a fix, please make it concise.
Last reviewed commit: "Merge branch 'main' ..."
tests_e2e/src/data_designer_e2e_tests/plugins/markdown_seed_reader/config.py
Outdated
Show resolved
Hide resolved
09cf5d5 to
3580fba
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR documents how to author
FileSystemSeedReaderplugins on top of the1:Nhydration contract introduced in #424.FileSystemSeedReaderplugin authoring guidetests_e2esmoke test for manifest-based selection with fanout hydrationWhy
#424 makes
hydrate_row()capable of returning either one record or many records per manifest row. This follow-up PR explains that contract for plugin authors and gives them a concrete example that splits Markdown files into section rows.Blocked By
This PR is intentionally stacked on top of
feat/filesystem-seed-reader-fanoutso the review only contains the docs/example changes. After #424 merges, retarget this PR tomain.Testing
uv run ruff check docs/assets/recipes/plugin_development/markdown_seed_reader tests_e2e/src/data_designer_e2e_tests/plugins/markdown_seed_reader tests_e2e/tests/test_e2e.pyuv run pytest tests/test_e2e.py -k markdown_section_seed_reader_plugin_fanout_respects_manifest_selection(fromtests_e2e/)UV_CACHE_DIR=/tmp/uv-cache uv run --group docs mkdocs build --strict(currently still fails on pre-existing repo-wide docs warnings unrelated to this branch)