Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 21 additions & 10 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,11 @@ Options:

- `--output-file`/`-o` (path) — write output to a file instead of stdout
- `--merge-file`/`-M` (path) — deep-merge a YAML file into the generated
schema; values from the file win on conflict; no field filtering applied
schema; values from the file win on conflict; the result is validated
against the LinkML meta schema
- `--overlay-file`/`-O` (path) — shallow-merge a YAML file into the
generated schema; only `SchemaDefinition` fields are applied; unknown
keys are skipped with a warning
generated schema; the result is validated against the LinkML meta
schema
- `--log-level`/`-l` (default: WARNING)

## Architecture
Expand All @@ -91,13 +92,20 @@ Options:
resolution context, field name, `FieldInfo`, and owning model
- `resolve_ref_schema()` — resolves `definition-ref` and `definitions`
schema types to concrete schemas
- `canonicalize_schema_yml(yml)` — round-trips a YAML string through
`SchemaDefinition` for canonical key ordering, then validates the
result against the LinkML meta schema via `linkml.validator`
(raises `InvalidLinkMLSchemaError` on unknown fields or wrong-type
values); the meta-schema validator is lazily initialized and cached
via `_get_meta_schema_validator()`
- `apply_schema_overlay(schema_yml, overlay_file)` — shallow-merges a
YAML file into a schema YAML string; restricts keys to
`SchemaDefinition` fields
YAML file into a schema YAML string; no field filtering; calls
`canonicalize_schema_yml` to reorder keys and validate the result
- `apply_yaml_deep_merge(schema_yml, merge_file)` — deep-merges a YAML
file into a schema YAML string using `deepmerge`; no field filtering
- `remove_schema_key_duplication(yml)` — strips redundant `name`/`text`
fields from serialized LinkML YAML
file into a schema YAML string using `deepmerge`; calls
`canonicalize_schema_yml` to reorder keys and validate the result
- `remove_schema_key_duplication(yml)` — strips redundant `name`/`text`/
`prefix_prefix` fields from serialized LinkML YAML
- `add_section_breaks(yml)` — inserts blank lines before top-level
sections

Expand All @@ -109,8 +117,8 @@ Options:

3. **`cli/`** — Typer-based CLI wrapping `translate_defs`; `cli/__init__.py`
defines the `app` and `main` command. After translation the pipeline is:
dump YAML → `remove_schema_key_duplication` → optional `-M` deep merge
→ optional `-O` overlay → `add_section_breaks` → output.
dump YAML → optional `-M` deep merge → optional `-O` overlay →
`remove_schema_key_duplication` → `add_section_breaks` → output.

4. **`exceptions.py`** — Custom exceptions:
- `NameCollisionError` — duplicate class/enum names across modules
Expand All @@ -120,6 +128,9 @@ Options:
via slot_usage
- `YAMLContentError` — YAML file content is not what is expected (e.g.,
not a mapping)
- `InvalidLinkMLSchemaError` — schema does not conform to the LinkML
meta schema (unknown fields, wrong-type values, etc.); raised by
`canonicalize_schema_yml`

### Key Design Patterns

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,6 @@ pydantic2linkml -o o.yml -l INFO dandischema.models
| Flag | Description |
|------|-------------|
| `-o` / `--output-file` | Write output to a file (default: stdout) |
| `-M` / `--merge-file` | Deep-merge a YAML file into the generated schema. Values from the file win on conflict; no field filtering is applied. |
| `-O` / `--overlay-file` | Shallow-merge a YAML file into the generated schema. Only `SchemaDefinition` fields are applied; unknown keys are skipped with a warning. |
| `-M` / `--merge-file` | Deep-merge a YAML file into the generated schema. Values from the file win on conflict; the result is validated against the LinkML meta schema. |
| `-O` / `--overlay-file` | Shallow-merge a YAML file into the generated schema. The result is validated against the LinkML meta schema. |
| `-l` / `--log-level` | Log level (default: `WARNING`) |
24 changes: 15 additions & 9 deletions src/pydantic2linkml/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from pydantic import ValidationError

from pydantic2linkml.cli.tools import LogLevel
from pydantic2linkml.exceptions import YAMLContentError
from pydantic2linkml.exceptions import InvalidLinkMLSchemaError, YAMLContentError
from pydantic2linkml.gen_linkml import translate_defs
from pydantic2linkml.tools import (
add_section_breaks,
Expand All @@ -31,9 +31,7 @@ def main(
"-M",
help="A YAML file whose contents are deep-merged into the generated "
"schema. Values from this file win on conflict. The result is "
"always a valid YAML file but may not be a valid LinkML schema — "
"it is the user's responsibility to supply a merge file that "
"produces a valid schema.",
"validated against the LinkML meta schema.",
),
] = None,
overlay_file: Annotated[
Expand All @@ -43,10 +41,7 @@ def main(
"-O",
help="An overlay file specifying a partial schema to be applied on top of "
"the generated schema. The overlay is merged into the serialized YAML "
"output, so the result is always a valid YAML file but may not be a "
"valid LinkML schema — it is the user's responsibility to supply an "
"overlay that produces a valid schema. Overlay keys that do not "
"correspond to a field of SchemaDefinition are skipped.",
"output. The result is validated against the LinkML meta schema.",
),
] = None,
output_file: Annotated[Optional[Path], typer.Option("--output-file", "-o")] = None,
Expand All @@ -59,7 +54,7 @@ def main(

schema = translate_defs(module_names)
logger.info("Dumping schema")
yml = remove_schema_key_duplication(yaml_dumper.dumps(schema))
yml = yaml_dumper.dumps(schema)
if merge_file is not None:
logger.info("Applying deep merge from %s", merge_file)
try:
Expand All @@ -79,6 +74,11 @@ def main(
f"The merge file does not contain a valid YAML mapping: {e}",
param_hint="'--merge-file'",
) from e
except InvalidLinkMLSchemaError as e:
raise typer.BadParameter(
f"The merge file produces an invalid schema: {e}",
param_hint="'--merge-file'",
) from e
if overlay_file is not None:
logger.info("Applying overlay from %s", overlay_file)
try:
Expand All @@ -93,6 +93,12 @@ def main(
f"The overlay file does not contain a valid YAML mapping: {e}",
param_hint="'--overlay-file'",
) from e
except InvalidLinkMLSchemaError as e:
raise typer.BadParameter(
f"The overlay file produces an invalid schema: {e}",
param_hint="'--overlay-file'",
) from e
yml = remove_schema_key_duplication(yml)
yml = add_section_breaks(yml)
if not output_file:
print(yml, end="") # noqa: T201
Expand Down
7 changes: 7 additions & 0 deletions src/pydantic2linkml/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,10 @@ class YAMLContentError(ValueError):
"""
Raise when the content of a YAML file is not what is expected
"""


class InvalidLinkMLSchemaError(ValueError):
"""
Raised when a YAML string does not conform to the LinkML meta schema
(e.g. unknown field names or wrong-type values)
"""
140 changes: 104 additions & 36 deletions src/pydantic2linkml/tools.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import functools
import importlib
import inspect
import logging
Expand All @@ -7,12 +8,15 @@
from collections.abc import Callable, Iterable
from dataclasses import fields
from enum import Enum
from importlib.resources import files as resource_files
from operator import attrgetter, itemgetter
from types import ModuleType
from typing import Any, NamedTuple, Optional, TypeVar, cast

import yaml
from linkml_runtime.dumpers import yaml_dumper
from linkml_runtime.linkml_model import SchemaDefinition, SlotDefinition
from linkml_runtime.loaders import yaml_loader
from linkml_runtime.utils.formatutils import is_empty
from pydantic import BaseModel, FilePath, RootModel, validate_call

Expand All @@ -22,9 +26,10 @@
from pydantic_core import core_schema

from pydantic2linkml.exceptions import (
InvalidLinkMLSchemaError,
NameCollisionError,
YAMLContentError,
SlotExtensionError,
YAMLContentError,
)

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -534,17 +539,83 @@ def get_slot_usage_entry(
)


@functools.cache
def _get_meta_schema_validator():
"""Return a cached LinkML meta-schema validator.

The validator is initialized lazily on first call (importing
``linkml.validator`` is slow) and then cached for reuse.
``closed=True`` adds ``additionalProperties: false`` to every object
type in the generated JSON Schema, so unknown field names are caught
as validation errors.
"""
from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin

meta_schema_path = str(
resource_files("linkml_runtime.linkml_model.model.schema").joinpath("meta.yaml")
)
return Validator(
meta_schema_path,
validation_plugins=[JsonschemaValidationPlugin(closed=True)],
)


def canonicalize_schema_yml(yml: str) -> str:
"""Canonicalize a YAML string as a LinkML schema via a round-trip.

Deserializes ``yml`` into a ``SchemaDefinition`` object,
re-serializes it to canonical YAML, then validates the canonical
output against the LinkML meta schema. This serves two purposes:

* **Canonical ordering** — the output keys follow the same order
produced by serializing a freshly constructed ``SchemaDefinition``.
* **Validation** — the canonical YAML is validated against the
LinkML meta schema. Unknown field names and wrong-type values for
known fields are caught and re-raised as ``InvalidLinkMLSchemaError``.

:param yml: A YAML string to canonicalize as a LinkML schema.
:return: Canonically ordered, validated YAML string representing the
schema.
:raises InvalidLinkMLSchemaError: If the resulting schema does not
conform to the LinkML meta schema (unknown field names,
wrong-type values, etc.).
"""
try:
sd = yaml_loader.loads(yml, target_class=SchemaDefinition)
except TypeError as e:
raise InvalidLinkMLSchemaError(f"Unknown field in schema: {e}") from e

canonical = yaml_dumper.dumps(sd)

validator = _get_meta_schema_validator()
report = validator.validate(yaml.safe_load(canonical), "schema_definition")
if report.results:
raise InvalidLinkMLSchemaError(
"Schema validation failed: " + "; ".join(r.message for r in report.results)
)

return canonical


@validate_call
def apply_schema_overlay(schema_yml: str, overlay_file: FilePath) -> str:
"""Apply an overlay YAML file onto a serialized schema YAML string.

:param schema_yml: YAML string of a serialized SchemaDefinition
All keys from the overlay are applied without filtering. The result
is then passed through ``canonicalize_schema_yml``, which reorders
keys canonically and validates the output against the LinkML meta
schema.

:param schema_yml: YAML string of a valid LinkML schema
:param overlay_file: Path to an existing overlay YAML file
:return: YAML string with the overlay applied, keys ordered to match
SchemaDefinition field order
:return: Canonical YAML string with the overlay applied, keys in
SchemaDefinition order
:raises ValueError: If ``schema_yml`` does not deserialize to a dict
:raises YAMLContentError: If the overlay file does not contain a YAML
mapping
:raises InvalidLinkMLSchemaError: If the result does not conform to
the LinkML meta schema
"""
schema_dict = yaml.safe_load(schema_yml)
if not isinstance(schema_dict, dict):
Expand All @@ -560,40 +631,31 @@ def apply_schema_overlay(schema_yml: str, overlay_file: FilePath) -> str:
f"Overlay file {overlay_file} must contain a YAML mapping"
)

# Ordered list of valid SchemaDefinition field names
sd_field_names = [f.name for f in fields(SchemaDefinition)]
sd_field_set = set(sd_field_names)

# Apply overlay, skipping keys that are not SchemaDefinition fields
for k, v in overlay.items():
if k not in sd_field_set:
logger.warning(
"Overlay key '%s' is not a field of SchemaDefinition. Skipping.",
k,
)
else:
schema_dict[k] = v

# Rebuild dict in SchemaDefinition field order
ordered = {k: schema_dict[k] for k in sd_field_names if k in schema_dict}

return yaml.dump(ordered, allow_unicode=True, sort_keys=False)
schema_dict.update(overlay)
return canonicalize_schema_yml(
yaml.dump(schema_dict, allow_unicode=True, sort_keys=False)
)


@validate_call
def apply_yaml_deep_merge(schema_yml: str, merge_file: FilePath) -> str:
"""Deep-merge a YAML file into a serialized schema YAML string.

Values from the merge file win on conflict. The merge is unrestricted —
no field filtering is applied.
no field filtering is applied. The result is then passed through
``canonicalize_schema_yml``, which reorders keys canonically and
validates the output against the LinkML meta schema.

:param schema_yml: YAML string of a valid LinkML schema
:param merge_file: Path to an existing YAML file containing a mapping
:return: YAML string with the deep merge applied
:raises ValueError: If ``schema_yml`` does not contain valid YAML or does
not deserialize to a dict
:return: Canonical YAML string with the deep merge applied
:raises ValueError: If ``schema_yml`` does not contain valid YAML or
does not deserialize to a dict
:raises yaml.YAMLError: If the merge file does not contain valid YAML
:raises YAMLContentError: If the merge file does not contain a YAML mapping
:raises YAMLContentError: If the merge file does not contain a YAML
mapping
:raises InvalidLinkMLSchemaError: If the result does not conform to
the LinkML meta schema
"""
from deepmerge import always_merger

Expand All @@ -613,26 +675,32 @@ def apply_yaml_deep_merge(schema_yml: str, merge_file: FilePath) -> str:
if not isinstance(merge_dict, dict):
raise YAMLContentError(f"Merge file {merge_file} must contain a YAML mapping")

return yaml.dump(
always_merger.merge(schema_dict, merge_dict),
allow_unicode=True,
sort_keys=False,
return canonicalize_schema_yml(
yaml.dump(
always_merger.merge(schema_dict, merge_dict),
allow_unicode=True,
sort_keys=False,
)
)


def remove_schema_key_duplication(yml: str) -> str:
"""Remove redundant name/text fields from a valid serialized LinkML schema.
"""Remove redundant name/text/prefix_prefix fields from a serialized
LinkML schema.

In LinkML's serialized YAML, dictionary keys already serve as
identifiers for classes, slots, enums, slot_usage entries, and
permissible values. This function strips the redundant ``name`` and
``text`` fields that the linkml-runtime YAML dumper includes alongside
those keys.
identifiers for classes, slots, enums, slot_usage entries,
permissible values, and prefixes. This function strips the redundant
``name``, ``text``, and ``prefix_prefix`` fields that the
linkml-runtime YAML dumper includes alongside those keys.

:param yml: A YAML string representing a **valid** LinkML schema.
"""
schema = yaml.safe_load(yml)

for prefix in schema.get("prefixes", {}).values():
prefix.pop("prefix_prefix", None)

for cls in schema.get("classes", {}).values():
cls.pop("name", None)
for su in cls.get("slot_usage", {}).values():
Expand Down
13 changes: 10 additions & 3 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,12 @@ def test_non_mapping(self, tmp_path: Path):
assert result.exit_code == 2
assert "does not contain a" in result.output.lower()

def test_unknown_key(self, tmp_path: Path):
def test_unknown_field_raises_bad_parameter(self, tmp_path: Path):
overlay_file = tmp_path / "overlay.yaml"
overlay_file.write_text("not_a_field: some_value\n")
result = runner.invoke(app, ["dandischema.models", "-O", str(overlay_file)])
assert result.exit_code == 0
assert "not_a_field" not in result.output
assert result.exit_code == 2
assert "not_a_field" in result.output


class TestCliDeepMerge:
Expand Down Expand Up @@ -106,3 +106,10 @@ def test_invalid_yaml(self, tmp_path: Path):
result = runner.invoke(app, ["dandischema.models", "-M", str(merge_file)])
assert result.exit_code == 2
assert "does not contain valid YAML" in result.output

def test_unknown_field_raises_bad_parameter(self, tmp_path: Path):
merge_file = tmp_path / "merge.yaml"
merge_file.write_text("not_a_field: some_value\n")
result = runner.invoke(app, ["dandischema.models", "-M", str(merge_file)])
assert result.exit_code == 2
assert "not_a_field" in result.output
Loading
Loading