Skip to content

Add recentchange SSE ingest service with Postgres persistence and replay tool#1

Draft
artvandelay wants to merge 1 commit intomainfrom
codex/create-service-for-eventstreams-ingest
Draft

Add recentchange SSE ingest service with Postgres persistence and replay tool#1
artvandelay wants to merge 1 commit intomainfrom
codex/create-service-for-eventstreams-ingest

Conversation

@artvandelay
Copy link
Copy Markdown
Owner

Motivation

  • Implement a long-running ingest service to reliably consume the Wikimedia recentchange SSE stream with resume/reconnect semantics for gap-free ingestion.
  • Filter events at ingest time to focus on enwiki and configured namespaces while dropping bot/minor edits to reduce noise.
  • Persist raw events durably for backfills and analysis and expose simple metrics for throughput, reconnects, and lag.
  • Provide a replay tool to reprocess stored events for backfills and testing.

Description

  • Add services/ingest/recentchange_ingest.py which parses the SSE stream, honors Last-Event-ID for resume, implements exponential backoff reconnects, filters events, logs structured JSON metrics, and writes events to the recentchange_events Postgres table.
  • Implement RecentChangeStore with automatic table/index creation and insert_event/fetch_last_event_id helpers to support idempotent inserts and resume.
  • Add scripts/replay_recentchange.py CLI to stream stored events from the DB (by id range/limit) and emit JSON lines for replay/backfill workflows, and add services/ingest/__init__.py.
  • Add psycopg2-binary to pyproject.toml and require POSTGRES_DSN (env/arg) for DB connectivity.

Testing

  • No automated tests were executed as part of this change.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant