Semi-automated pipeline to annotate Elasticsearch index fields for:
- API documentation →
schemas/<index>.json(clean JSON Schema draft-07) - MCP server →
annotations/<index>.yaml(annoted fields)
uv syncES_URL: URL of the Elasticsearch cluster ES_API_KEY: API key for Elasticsearch MISTRAL_COMPLETION_URL: URL of the Mistral API (chat completions endpoint) MISTRAL_API_KEY: API key for Mistral
The pipeline uses index-specific configuration files located in the configs/ directory.
Create configs/<index-name>.yaml:
<index-name>:
schema: <index_name>.json
annotation: <index_name>.yaml
content: "Description of what this index contains for AI context."
primary_fields:
- id
- title.default
- year
excludes:
- secret_field.*
includes:
- secret_field.public_part
cross_index:
organization_id:
index: scanr-organizations
join_field: idES index ──► merge.py ──► annotations/<index>.yaml ──► enrich.py ──► review.py ──► export.py
▲ │
└───────────── iterate ───────────────┘
Pull ES mapping and merge with an optional existing JSON schema (from schemas/backup/):
python -m src.merge --index scanr-publicationsOptions:
--index,-i: (Required) The ES index name.--schema,-s: Override the default backup schema path.
This creates/updates annotations/<index>.yaml. Existing approved descriptions are preserved.
Send all draft fields to Mistral in batches for description suggestions:
python -m src.enrich --index scanr-publicationsOptions:
--index,-i: (Required) The ES index name.--field,-f: Restrict to a single dotted field path.--force: Force re-enrichment of fields that already have an AI suggestion.
Suggestions are written into annotations/<index>.yaml under ai_suggestion:.
Interactive CLI to approve/reject AI suggestions:
python -m src.review --index scanr-publicationsFor each field:
[a]accept suggestion as-is[e]edit then accept[s]skip (stays draft)[r]reject (stays draft, suggestion removed)[q]quit and save
Options:
--index,-i: (Required) The ES index name.--field,-f: Review a specific field.
Generate the final JSON Schema:
python -m src.export --index scanr-publicationsOptions:
--index,-i: (Required) The ES index name.--include-draft: Include fields even if they are still indraftstatus.--include-ai-suggestion: Use AI suggestions for descriptions if no approved description exists.--output,-o: Override the output filename inschemas/.
Located in annotations/, this file is the source of truth for field documentation.
_meta:
index: scanr-publications
total_fields: 42
approved: 38
draft: 4
fields:
id:
status: approved # approved | draft
type: keyword
description: "Main PID of the publication..."
primary: true
authors.fullName:
status: draft
type: text
primary: true
ai_suggestion: # written by enrich.py, removed after review
description: "Full name of the author as a single string."
affiliations.id:
status: approved
type: keyword
description: "Internal identifier of the affiliated organization."
cross_index: # from configs/<index>.yaml
index: scanr-organizations
join_field: idSince the pipeline is exposed via a GitHub Actions workflow_dispatch event, you can trigger it programmatically from other repositories or scripts.
If you have the gh CLI authenticated:
gh workflow run annotate.yaml \
--repo dataesr/elastic-annotation \
-f index="scanr-publications" \
-f skip_enrich="false" \
-f include_draft="false"You can invoke the GitHub Actions REST API using a Personal Access Token (PAT) with repo (or actions:write) permissions. This is useful for triggering the pipeline step automatically at the end of another CI/CD pipeline:
curl -X POST \
-H "Accept: application/vnd.github.v3+json" \
-H "Authorization: Bearer YOUR_GITHUB_PAT" \
https://api.github.com/repos/dataesr/elastic-annotation/actions/workflows/annotate.yaml/dispatches \
-d '{
"ref": "main",
"inputs": {
"index": "scanr-publications, scanr-organizations",
"skip_enrich": "false",
"include_draft": "false"
}
}'