Schema Enrichment Pipeline

Semi-automated pipeline to annotate Elasticsearch index fields for:

API documentation → schemas/<index>.json (clean JSON Schema draft-07)
MCP server → annotations/<index>.yaml (annoted fields)

Install

uv sync

Environment variables

ES_URL: URL of the Elasticsearch cluster ES_API_KEY: API key for Elasticsearch MISTRAL_COMPLETION_URL: URL of the Mistral API (chat completions endpoint) MISTRAL_API_KEY: API key for Mistral

Setup

The pipeline uses index-specific configuration files located in the configs/ directory.

1. Fill a configuration file

Create configs/<index-name>.yaml:

<index-name>:
  schema: <index_name>.json
  annotation: <index_name>.yaml
  content: "Description of what this index contains for AI context."
  primary_fields:
    - id
    - title.default
    - year
  excludes:
    - secret_field.*
  includes:
    - secret_field.public_part
  cross_index:
    organization_id:
      index: scanr-organizations
      join_field: id

Workflow

ES index ──► merge.py ──► annotations/<index>.yaml ──► enrich.py ──► review.py ──► export.py
                                 ▲                                     │
                                 └───────────── iterate ───────────────┘

Step 1 — merge

Pull ES mapping and merge with an optional existing JSON schema (from schemas/backup/):

python -m src.merge --index scanr-publications

Options:

--index, -i: (Required) The ES index name.
--schema, -s: Override the default backup schema path.

This creates/updates annotations/<index>.yaml. Existing approved descriptions are preserved.

Step 2 — enrich

Send all draft fields to Mistral in batches for description suggestions:

python -m src.enrich --index scanr-publications

Options:

--index, -i: (Required) The ES index name.
--field, -f: Restrict to a single dotted field path.
--force: Force re-enrichment of fields that already have an AI suggestion.

Suggestions are written into annotations/<index>.yaml under ai_suggestion:.

Step 3 — review

Interactive CLI to approve/reject AI suggestions:

python -m src.review --index scanr-publications

For each field:

[a] accept suggestion as-is
[e] edit then accept
[s] skip (stays draft)
[r] reject (stays draft, suggestion removed)
[q] quit and save

Options:

--index, -i: (Required) The ES index name.
--field, -f: Review a specific field.

Step 4 — export

Generate the final JSON Schema:

python -m src.export --index scanr-publications

Options:

--index, -i: (Required) The ES index name.
--include-draft: Include fields even if they are still in draft status.
--include-ai-suggestion: Use AI suggestions for descriptions if no approved description exists.
--output, -o: Override the output filename in schemas/.

annotations.yaml structure

Located in annotations/, this file is the source of truth for field documentation.

_meta:
  index: scanr-publications
  total_fields: 42
  approved: 38
  draft: 4

fields:
  id:
    status: approved          # approved | draft
    type: keyword
    description: "Main PID of the publication..."
    primary: true

  authors.fullName:
    status: draft
    type: text
    primary: true
    ai_suggestion:            # written by enrich.py, removed after review
      description: "Full name of the author as a single string."

  affiliations.id:
    status: approved
    type: keyword
    description: "Internal identifier of the affiliated organization."
    cross_index:                # from configs/<index>.yaml
      index: scanr-organizations
      join_field: id

Triggering Pipeline Externally

Since the pipeline is exposed via a GitHub Actions workflow_dispatch event, you can trigger it programmatically from other repositories or scripts.

Using GitHub CLI

If you have the gh CLI authenticated:

gh workflow run annotate.yaml \
  --repo dataesr/elastic-annotation \
  -f index="scanr-publications" \
  -f skip_enrich="false" \
  -f include_draft="false"

Using cURL (REST API)

You can invoke the GitHub Actions REST API using a Personal Access Token (PAT) with repo (or actions:write) permissions. This is useful for triggering the pipeline step automatically at the end of another CI/CD pipeline:

curl -X POST \
  -H "Accept: application/vnd.github.v3+json" \
  -H "Authorization: Bearer YOUR_GITHUB_PAT" \
  https://api.github.com/repos/dataesr/elastic-annotation/actions/workflows/annotate.yaml/dispatches \
  -d '{
    "ref": "main",
    "inputs": {
      "index": "scanr-publications, scanr-organizations",
      "skip_enrich": "false",
      "include_draft": "false"
    }
  }'

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
annotations		annotations
configs		configs
schemas		schemas
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Schema Enrichment Pipeline

Install

Environment variables

Setup

1. Fill a configuration file

Workflow

Step 1 — merge

Step 2 — enrich

Step 3 — review

Step 4 — export

annotations.yaml structure

Triggering Pipeline Externally

Using GitHub CLI

Using cURL (REST API)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Schema Enrichment Pipeline

Install

Environment variables

Setup

1. Fill a configuration file

Workflow

Step 1 — merge

Step 2 — enrich

Step 3 — review

Step 4 — export

annotations.yaml structure

Triggering Pipeline Externally

Using GitHub CLI

Using cURL (REST API)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages