Skip to content

Detect invalid vocabulary terms (properties and classes) in dataset analysis #116

@ddeboer

Description

@ddeboer

Problem

Datasets sometimes contain typos or invalid terms from well-known vocabularies. For example, the Kleksi Musiom collection uses schema:ceator (31 records) instead of schema:creator. These invalid terms go undetected and propagate through the pipeline into the dataset browser.

Proposed solution

Add a vocabulary term validation step to @lde/pipeline-void that checks whether properties and classes found in the data are actually defined in their respective vocabularies.

Implementation: TermValidationExecutor (ExecutorDecorator)

A new executor decorator (like VocabularyExecutor) that wraps existing analysis executors:

  1. Passes through all quads from the inner executor.
  2. Collects void:property and void:class IRIs from the output.
  3. Checks each term against known vocabulary definitions for its namespace.
  4. Appends validation quads for unrecognized terms.

Output format

The output is not pure VoID — it extends the VoID property/class partitions with data quality annotations using schema:error:

# Existing output (from entity-properties.rq):
<.../void#property-partition-abc123>
    void:property  schema:ceator ;
    void:entities  31 .

# Appended by TermValidationExecutor:
<.../void#property-partition-abc123>
    schema:error  "schema:ceator is not a recognized schema.org property" .

# Same for invalid classes (from class-partition.rq):
<.../void#class-abc456>
    void:class   schema:ceator ;
    void:entities  31 .

<.../void#class-abc456>
    schema:error  "schema:ceator is not a recognized schema.org class" .

This reuses schema:error as a literal, consistent with how distribution probe failures are reported in @lde/pipeline. Consumers can query for schema:error on partition nodes to find data quality issues.

Sourcing valid term lists

The best fit appears to be @zazuko/rdf-vocabularies (and its @vocabulary/* packages), which is in the same Zazuko ecosystem as @zazuko/prefixes (already a dependency). It bundles full RDF definitions for ~60 vocabularies including schema.org, Dublin Core, FOAF, and SKOS. Valid terms can be extracted by filtering on rdf:type rdf:Property / rdf:type rdfs:Class.

Scope

  • Start with schema.org validation (most common vocabulary in the datasets we process).

Context

Discovered via the Kleksi Phase 1 quality report, which identified schema:ceator as a high-severity data quality issue affecting 31 records.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions