Skip to content

Comments

Add OpenTelemetry metrics and traces instrumentation#2927

Open
kvirund wants to merge 89 commits intomasterfrom
metrics-traces-instrumentation
Open

Add OpenTelemetry metrics and traces instrumentation#2927
kvirund wants to merge 89 commits intomasterfrom
metrics-traces-instrumentation

Conversation

@kvirund
Copy link
Collaborator

@kvirund kvirund commented Feb 23, 2026

Summary

  • Integrates OpenTelemetry (OTel) SDK for distributed tracing and metrics collection
  • Replaces InfluxDB integration with vendor-neutral OTel observability pipeline
  • Adds instrumentation across all major game subsystems: combat, magic, AI, crafting, auction, DG scripts, zone updates, player save/load, heartbeat
  • Adds RAII helper classes for clean OTel span management
  • Adds Grafana dashboards for visualizing OTel data
  • Includes comprehensive observability documentation and deployment guide
  • Merges world-load-refactoring (PR World load refactoring #2891) changes

Instrumented systems

  • Combat system (fight.cpp)
  • Magic/Spell system
  • Mobile AI (mobact.cpp)
  • Zone Update system
  • Player save/load
  • Beat Points Update & Player Statistics
  • DG Script Trigger system
  • Auction system
  • Crafting system
  • Heartbeat with trace_id correlation in logs

Test plan

  • Build with -DWITH_OTEL=ON and verify compilation
  • Build without OTel (-DWITH_OTEL=OFF) and verify no regressions
  • Run unit tests: ./tests/tests
  • Boot server and verify OTel spans appear in configured backend (Jaeger/OTLP)
  • Check Grafana dashboards load correctly

🤖 Generated with Claude Code

kvirund and others added 30 commits January 19, 2026 09:35
Step 1 of world loading refactoring plan - baseline checksums.

New files:
- src/engine/db/world_checksum.h/cpp: CRC32-based checksum calculation
  for zones, rooms, mobs, objects, and triggers

Features:
- Calculates individual checksums per entity type using XOR aggregation
- Combined checksum for detecting any world data changes
- Detailed per-object checksums saved to file for diff analysis
- CLI flag -C to disable checksum calculation

Integration:
- Checksums calculated at end of GameLoader::BootWorld()
- Results logged to syslog and saved to checksums_detailed.txt

CMake additions:
- FULL_WORLD_PATH option for specifying full world data location
- Automatic setup of small/full data directories in build dir

Baseline checksums:
  Small World (lib):     Combined: 4E6499FF
  Full World:            Combined: BB58755C

Detailed checksums saved in checksums_small.txt and checksums_full.txt
for future comparison after refactoring.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces an interface-based abstraction layer for world data loading:
- IWorldDataSource interface with LoadZones/Triggers/Rooms/Mobs/Objects
- LegacyWorldDataSource wraps existing BootIndex() calls
- GameLoader::BootWorld() now accepts optional data source parameter
- Excludes zone_rn from room checksums (runtime-calculated value)
- Fixes compiler warnings (unused variable, strncpy truncation)

Checksums verified identical before/after refactoring:
- Small world: B6DA5931
- Full world: 82CF7A3E

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add optional SQLite support via HAVE_SQLITE CMake flag
- Create SqliteWorldDataSource skeleton class (load methods not yet implemented)
- Add Save methods to IWorldDataSource interface for OLC
- Implement Save methods in LegacyWorldDataSource (delegates to *_save_to_disk)
- Add trigedit_save_to_disk function for trigger saving
- Fix compiler warnings in utils.cpp (array bounds, strncpy truncation)
- Add Claude Code workflow rules to CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add complete implementation for loading world data from SQLite database:
- Zones with commands (M,O,G,E,P,D,R,T,V,Q,F) and typeA/typeB groups
- Triggers with script parsing into cmdlist
- Rooms with flags, exits, triggers, and extra descriptions
- Mobs with flags, skills, triggers, and all attributes
- Objects with extra/wear/no/anti flags, applies, triggers, extra descriptions

Schema matches mud-docs/world_schema.sql specification.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- GetText now returns std::string with UTF-8 to KOI8-R conversion
- Add SafeStoi/SafeStol helper functions for safe string-to-number conversion
- Fix all const char* usages to std::string
- Fix to_room to store vnum (not rnum) - RosolveWorldDoorToRoomVnumsToRnums will convert later
- Fix top_of_mobt to be last valid index (not count) for compatibility with CreateBlankMobsDungeon

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove -S command line option for SQLite database path
- Move chdir() before config loading so paths are relative to data dir
- Fix configuration.xml path to be relative (misc/ instead of lib/misc/)
- Auto-detect world.db in data directory: if exists use SQLite, else legacy

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add direction_map to convert direction strings (north/east/etc) to numbers
- Fix DOOR command arg2 to use direction_map instead of SafeStoi
- Add load_prob (arg4) loading for GIVE_OBJ commands

Zones checksums now match between legacy and SQLite loaders.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Set NPC flag before set_level() to avoid clamping mob levels to 34
  (kLvlImplementator limit for non-NPCs)
- Fix long_descr/description column swap (columns 8 and 9)
- Set max_hit to 0 (flag for dice-based HP calculation)
- Add trigger existence validation with warnings for missing triggers
- Use ORDER BY rowid for predictable trigger loading order
- Skip non-existent triggers instead of adding invalid references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests/utils.encoding.cpp with unit tests for utf8_to_koi function
  covering ASCII, Cyrillic, NO-BREAK SPACE, and box drawing characters
- Fix NO-BREAK SPACE (U+00A0) conversion: UTF-8 0xC2 0xA0 -> KOI8-R 0x9A
- Add lib symlink creation in CMake for running server from build directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add sex field to SQL query and loading code
- Fix set_level vs set_minimum_remorts bug (was reading level column
  but calling wrong setter)
- Update column indices for max_in_world after sex addition

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SQLite loader now calculates zone_rn incrementally by vnum (matching Legacy)
- Add extra_flags, anti_flags, no_flags, affect_flags to object checksum
- Add extra_descriptions to object checksum

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…lags

- Add kTrap to obj_type_map for proper type loading
- Handle NULL max_in_world by returning -1 (matching Legacy behavior)
- Add affect flag category handling for object weapons

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… loader

- Add kElementWeapon, kMissile, kWorm, kCraftMaterial2 to obj_type_map
- Add missing extra flags (kSwimming, kFlying, kThrowing, plane 1 flags)
- Apply colorLOW to short_description and PNames (match Legacy loader)
- Apply colorCAP to description (match Legacy loader)
- Add utils_string.h include for colorLOW/colorCAP
- Update CLAUDE.md with SQLite world conversion documentation
- Add patch-based editing guidance to CLAUDE.md

Objects match: 99.7% (13 remaining differences)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Clear runtime flags (kTransformed, kTicktimer) after loading objects
- Set max_in_world to -1 for objects with kZonedacay or kRepopDecay flags

This ensures SQLite loader produces identical object prototypes to Legacy.
All 5192 objects now match (100% checksum match).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use normalized trigger_type_bindings table with JOIN query
- Compute trigger_type bitmask from type_chars (a-z = bits 0-25, A-Z = bits 26-51)
- Add TrimRight for script lines to remove trailing whitespace
- Add indent_trigger call to normalize script indentation
- Include dg_olc.h for indent_trigger function

All world checksums now match between Legacy and SQLite loaders.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Read obj_type_id, sector_id, attach_type_id, direction_id directly
- Read location_id, skill_id, arg_wear_pos_id, arg_direction_id directly
- Remove unused text-to-enum conversion maps
- Use static_cast for direct integer-to-enum conversion
- Matches normalized schema in mud-docs

All checksums verified to match between Legacy and SQLite loaders.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added tools/:
- convert_to_yaml.py: Legacy world to SQLite/YAML converter
- world_schema.sql: SQLite database schema
- sqlite-world-schema.md: Schema documentation
- compare_world_checksums.sh: Test script for verifying checksums

Updated .gitignore to exclude build directories.
Removed generated checksum files (now in test builds).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 'enabled' column to SQLite schema for zones, rooms, mobs, objects,
  triggers to support index file filtering
- Update converter to read index files and mark non-indexed entities
  as disabled (enabled=0)
- Update SQLite loader to filter on enabled=1, matching Legacy behavior
- Add minimum_remorts column to objects table
- Add detailed checksum comparison infrastructure:
  - SaveDetailedBuffers() saves serialization buffers per entity
  - LoadBaselineChecksums() loads baseline for comparison
  - CompareWithBaseline() reports mismatches with field-level detail
- Update compare_world_checksums.sh with --rebuild and --reconvert flags
- Fix room exit serialization to use vnum instead of rnum

Checksum verification: Small world shows 100% match between Legacy and
SQLite loaders (zones, rooms, mobs, objects, triggers all identical).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When built with HAVE_SQLITE support but world.db file is not found,
exit with error instead of silently falling back to legacy loader.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Restored KOI8-R encoding (was corrupted in 0c9ca3c)
- Added includes for world_checksum, legacy/sqlite data sources
- Renamed world_loader to game_loader
- Refactored BootWorld to use IWorldDataSource abstraction
- Added checksum calculation and baseline comparison at boot
- Added no_world_checksum flag to disable checksums

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- setup_test_dirs.sh: Creates test directories for Legacy/SQLite comparison
- run_load_tests.sh: Runs performance tests and compares checksums
- Add test/ and magic.mgc to .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The Trigger constructor expects rnum (runtime array index) as the first
parameter, not vnum (persistent database ID). Passing vnum caused
out-of-bounds array access in GET_TRIG_VNUM macro when the vnum was
larger than the trig_index array size, resulting in segfaults during
zone reset on larger worlds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Filter files by pattern ^\d+\.<ext>$ to ignore backup files like 16.old.obj
- Fix armor parsing for negative values (use lstrip('-').isdigit())
- Use \r\n for joining multi-line aliases and case names (Legacy fread_string
  converts \n to \r\n)
- Remove .strip() calls that were removing control characters like \x1d
- Keep trailing spaces in aliases to match Legacy behavior

This significantly reduces checksum differences:
- MOB: 1354 → 1
- OBJ: still has differences (to be investigated)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Schema changes:
- Replace UNIQUE constraint on entity_triggers with trigger_order column
- Allows duplicate triggers (same trigger attached multiple times)

Converter changes:
- Add trigger_order field for proper trigger ordering
- Fix plane 2 offset in parse_ascii_flags (43 → 60)
- Each plane has 30 bits, not varying sizes

Loader changes:
- Add explicit flag maps for affect, anti, no flags
- Replace ITEM_BY_NAME with direct map lookups
- More reliable flag loading without silent failures

Progress (small world after reconvert):
  Zones:    100.0% (0 diff)
  Rooms:     99.9% (3 diff - missing kNoItem plane 2 flag)
  Mobs:     100.0% (0 diff)
  Objects:  100.0% (0 diff)
  Triggers: 100.0% (0 diff)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
- Add --skip-encoding option
- Check for UTF-8 BOM and Cyrillic in source files
- Increase default dump count to 10
- Add buffer comparison for all entity types (rooms, triggers, zones)
- Show field-by-field diff using | separator
- Use temp files to avoid binary file issues with diff

Progress (small world):
  Zones:    100.0% (0 diff)
  Rooms:     99.9% (3 diff - missing kNoItem plane 2 flag)
  Mobs:     100.0% (0 diff)
  Objects:  100.0% (0 diff)
  Triggers: 100.0% (0 diff)

Progress (full world):
  Zones:     44.6% (354 diff)
  Rooms:     99.0% (435 diff)
  Mobs:     100.0% (0 diff)
  Objects:   99.5% (95 diff)
  Triggers:  97.8% (367 diff)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Converter:
- Add load_prob parsing for E (EQUIP_MOB) command

Loader:
- Add arg4 (load_prob) reading for EQUIP_MOB commands
- Add arg4 (load_prob) reading for PUT_OBJ commands

Progress (full world):
  Zones:     78.7% (136 diff - zone.group not loaded yet)
  Rooms:     99.0% (435 diff - kNoItem flag)
  Mobs:     100.0% (0 diff) ✓
  Objects:   99.5% (95 diff)
  Triggers:  97.8% (367 diff)

Progress (small world):
  Zones:    100.0% ✓
  Rooms:     99.9% (3 diff - kNoItem flag)
  Mobs:     100.0% ✓
  Objects:  100.0% ✓
  Triggers: 100.0% ✓

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Converter:
- Add UNUSED entries (indices 43-59) to ROOM_FLAGS array
- Plane 2 flags (kNoItem, kDominationArena) now at correct indices 60-61
- Fixes room flags like kNoItem not being parsed from 'a2' format

Tests:
- Add test_convert_to_yaml.py with unit tests for parse_ascii_flags
- Test plane 0, 1, 2 flags parsing
- Test actual room 101 flags from small world
- Verify ROOM_FLAGS array structure

Progress (small world):
  Zones:    100.0% ✓
  Rooms:    100.0% ✓
  Mobs:     100.0% ✓
  Objects:  100.0% ✓
  Triggers: 100.0% ✓

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Read zone_group from database instead of hardcoding to 1
- Apply same 0->1 conversion as Legacy loader
- Add under_construction column to schema and converter
- Parse 'test' flag in zone files for under_construction
- Small world: 100% match on all categories
- Full world: Zones 93.1%, Rooms 99.3%, Mobs 100%, Objects 99.5%, Triggers 97.8%

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Apply same weight correction as Legacy for kLiquidContainer
and kFountain objects: if weight < val1, set weight = val1 + 5.

Progress on full world:
- Objects: 99.5% (90 diff, was 95)
- Small world: 100% match

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
kvirund and others added 30 commits February 23, 2026 00:33
Stack: OTEL Collector -> Prometheus / Tempo / Loki -> Grafana

Files:
- docker-compose.observability.yml  - full stack definition
- otel-collector-config.yaml        - OTLP receiver, batch/memory processors
- prometheus.yml                    - scrape configs + remote write receiver
- tempo-config.yaml                 - trace storage (local filesystem)
- loki-config.yaml                  - log storage (tsdb v12)
- grafana/provisioning/datasources/ - Prometheus, Tempo, Loki with trace-log correlation
- grafana/provisioning/dashboards/  - auto-provision from ./dashboards/

Usage:
  cd tools/observability
  docker compose -f docker-compose.observability.yml up -d

Grafana: http://localhost:3000 (admin/admin123)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… conflicts

ERROR, DEBUG, INFO, WARN are defined as macros in Windows headers, conflicting
with the LogLevel enum values. Rename to kError, kDebug, kInfo, kWarn following
the project's k-prefix convention for enum values.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- otel_metrics.h: add #include <cstdint> for int64_t (MinGW doesn't
  pull it in transitively unlike GCC/MSVC)
- docker-compose.observability.yml: bind ports to 127.0.0.1 only;
  remove unused tempo OTLP port mapping
- build.yml: add Linux / GCC / OTEL and OTEL + Admin API matrix entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- otel_traces.h: add #include <cstdint> for int64_t (MinGW)
- build.yml: rename Admin API builds to include + OTEL in name;
  replace non-existent libopentelemetry-cpp-dev with source build
  (cached at /opt/opentelemetry-cpp, key otel-cpp-1.24.0-ubuntu-x64)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docker-compose: replace named volumes with bind mounts using
  ${DATA_DIR:-./data}; each service stores in its own subdirectory
- loki-config.yaml: increase retention from 30d (720h) to 1 year (8760h);
  full index kept for entire period (dedup savings are negligible)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e (DATA_DIR)

Default run uses Docker named volumes. To store data in a specific
host directory, use the override file with DATA_DIR set:

  DATA_DIR=/var/lib/mud-observability \
  docker-compose -f docker-compose.observability.yml \
                 -f docker-compose.data-dir.yml up -d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sions

- DEPLOYMENT_GUIDE.md: document docker.io + docker-compose packages;
  update launch section with Variant A (named volumes) and Variant B
  (DATA_DIR bind mounts); add chmod 644 step for config files
- docker-compose.data-dir.yml: add user: UID:GID to all services so
  written files are owned by the host user, not container-internal users

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Selects compose mode based on DATA_DIR:
- DATA_DIR unset: named volumes (docker-compose.observability.yml only)
- DATA_DIR set:   bind mounts to host dir (+ docker-compose.data-dir.yml),
                  creates subdirs, sets UID/GID for correct file ownership

Passes all arguments through to docker-compose (e.g. up -d, down, logs -f).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Script supports two modes:
- vcpkg (default): installs vcpkg to ~/vcpkg and builds SDK via it
- --source: builds from source, installs to /usr/local (~15 min)

DEPLOYMENT_GUIDE.md: new section "Сборка Bylins MUD с поддержкой OTEL"
with note that binary-deploy users can skip it entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- All 3 dashboards translated: panel titles, legends, descriptions
- Add read-only description with "Save As" hint to each dashboard
- dashboards.yml: set allowUiUpdates=false to prevent accidental
  UI edits that would be overwritten on next file reload

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- editable: false in all 3 dashboard JSONs (hides edit button in UI)
- Add read-only notice text panel at top of each dashboard

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… miss)

opentelemetry-cpp cmake config requires protobuf to be findable at
configure time regardless of whether the SDK was just built or restored
from cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rsion})

Move OtelProvider::Initialize from load_telemetry_configuration() into
a new setup_telemetry(port) method called from comm.cpp after the port
is known. This enables variable substitution in the service name:
${port}, ${host}, ${version}.

Update default configuration.xml with an uncommented <telemetry> section
(enabled=false by default). Without WITH_OTEL build flag the section is
parsed but ignored.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Handle three cases:
- vcpkg binary present -> skip clone+bootstrap
- .git dir present but no binary -> bootstrap only
- empty/missing dir -> clone and bootstrap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cpkg

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- heartbeat.cpp: fix printf format specifier %lld -> %d for int argument
- global_objects.cpp: fix -Wreorder by matching initializer order to member declaration order

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cygwin-install-action already caches Cygwin packages. Cache C:\cygwin\usr\local
(googletest + yaml-cpp install prefix) to skip ~10 min source builds on cache hit.

Keys:
  cygwin-gtest-1.14.0            (Base job)
  cygwin-gtest-1.14.0-yaml-0.7.0 (YAML job)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both jobs install to the same C:\cygwin\usr\local prefix, so they
can share one cache entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Base job caches only googletest (cygwin-gtest-1.14.0).
YAML job caches googletest + yaml-cpp (cygwin-gtest-1.14.0-yaml-0.7.0).
Shared key risks Base saving an incomplete cache before YAML runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant