This library will be a core component of Netdata. Once merged and integrated, it will be downloaded 1.5M+ times per day and installed on physical servers, VMs, IoT devices, and exotic environments. It will be stressed in ways we cannot predict.
The only valid approach: quality, performance, simplicity, maintainability, smooth + reliable + robust operations.
The obviously invalid approaches:
- Overlooking "secondary" issues — there are no secondary issues in a library running on 1.5M+ installations
- Optimizing to minimize work — cutting corners creates risk at scale
- Deferring known issues to later phases — if the spec says it, it must be implemented now, not "when it becomes critical"
- Dismissing review findings as "low priority" — every finding from any reviewer must be evaluated honestly and fixed if valid
- Taking shortcuts on validation, error handling, or edge cases — every unvalidated input is a potential crash on production systems
Concrete rules:
- Every input from the wire must be validated before use. No exceptions.
- Every spec requirement must be implemented in the phase where it belongs. Do not defer spec requirements to later phases.
- Every review finding must be fixed unless it is demonstrably wrong (not "low priority" — wrong).
- Every code path must be tested, including error paths and edge cases.
- No hardcoded magic numbers — use named constants.
- No panics, no crashes, no undefined behavior on any input.
- Security is not optional. Path traversal, buffer overflow, integer overflow, unvalidated lengths — all must be prevented by design.
STOP. Before doing anything, read every file in docs/ in full.
Not skimming. Not pattern-scanning. Every file, every line. The specs
are the authority. Implementation must match them exactly.
Files to read in order:
docs/README.md— architecture overviewdocs/level1-transport.md— L1 principles and featuresdocs/level1-wire-envelope.md— wire format byte layoutsdocs/level1-posix-uds.md— UDS transport contractdocs/level1-posix-shm.md— POSIX SHM contractdocs/level1-windows-np.md— Named Pipe contractdocs/level1-windows-shm.md— Windows SHM contractdocs/codec.md— Codec principlesdocs/codec-cgroups-snapshot.md— cgroups snapshot wire layoutdocs/level2-typed-api.md— L2 orchestrationdocs/level3-snapshot-api.md— L3 cachingdocs/code-organization.md— module boundaries and file layout
Do not write code until all 12 files are read and understood.
- The specs in
docs/are closed, reviewed, and validated through 4 rounds of independent AI review. - The old implementation exists in git history (committed and pushed
on branch
main) but will be deleted in Phase 0. - The old implementation served its purpose for requirements discovery but does not match the specs. A clean rewrite is faster and cleaner than refactoring.
- The old code remains accessible via
git log/git showas reference for platform-specific edge cases.
┌─────────────────────────────────────────────┐
│ Level 3: Caching (client-side only) │
│ (built on Level 2) │
├─────────────────────────────────────────────┤
│ Level 2: Orchestration │
│ (built on Level 1 + Codec) │
├──────────────────────┬──────────────────────┤
│ Level 1: │ Codec: │
│ Transport │ Wire ↔ Typed │
│ (opaque bytes) │ (no I/O) │
└──────────────────────┴──────────────────────┘
- L1 and Codec are parallel — neither depends on the other.
- L2 depends on L1 + Codec.
- L3 is client-side caching on top of L2. Server side is pure L2.
- Complete one phase fully before moving to the next.
- Each phase includes: implementation, unit tests, integration tests, fuzz tests (where applicable), and validation.
- Commit after each completed phase. Use descriptive commit messages.
- Push after each commit.
- Do not start the next phase until the current phase passes all its validation criteria.
- If a phase goes wrong,
git resetto the last committed checkpoint and retry. Do not accumulate broken state. - Between phases, Costa reviews. Do not proceed to the next phase without review approval.
- 100% test coverage (line + branch) for all library code.
- Fuzz testing for all decode/parse paths.
- Cross-language interop tests for all wire contracts.
- Abnormal path coverage for all failure modes.
- No exceptions. Nothing integrates into Netdata without this.
After completing each phase, run ALL of these reviewers in parallel and incorporate their feedback before committing:
# Codex (gpt-5.4)
timeout 1800 codex exec "[prompt]" --skip-git-repo-check
# GLM-5
timeout 1800 opencode run -m "llm-netdata-cloud/glm-5" --agent code-reviewer "[prompt]"
# Kimi-K2.5
timeout 1800 opencode run -m "llm-netdata-cloud/kimi-k2.5-alibaba" --agent code-reviewer "[prompt]"
# Qwen3.5
timeout 1800 ~/.npm-global/bin/qwen --yolo -p "[prompt]"Prompt template for reviewers:
Read all files in docs/ in this repository (the specs).
Then review the implementation in [files changed in this phase].
Check for:
1. Spec compliance — does the implementation match the specs exactly?
2. Wire compatibility — will C, Rust, Go produce identical bytes?
3. Test coverage — are all paths tested, including failure paths?
4. Fuzz coverage — are all decode paths fuzz-tested?
5. Code quality — is the code minimal, clean, and maintainable?
DO NOT MAKE CHANGES. PROVIDE YOUR REVIEW.
All reviewers must agree before committing. If any reviewer finds a real issue, fix it and re-review.
Windows phases are developed locally on Linux, committed, pushed,
then validated on win11:
ssh win11
cd ~/src/plugin-ipc.git
git pull
# compile, test, benchmarkIf Windows validation fails, fix locally, push, re-validate remotely.
- Go must be pure Go. No cgo. No exceptions. This is a hard
requirement for compatibility with Netdata's
go.d.plugin. - Go SHM on Linux uses raw
SYS_FUTEXsyscall, not cgo semaphores. - All languages share the same wire format. If Go can't do something without cgo, find a pure-Go alternative or fall back to baseline.
- CMake is the top-level build system. All targets (C, Rust, Go, tests, benchmarks, fixtures) must be buildable via CMake.
Cargo.tomlandgo.modremain as language-native metadata.- CMake drives
cargo buildandgo buildas custom targets. cmake -S . -B build && cmake --build buildmust build everything.
- SHM max throughput below 1M req/s is a bug — on both POSIX and Windows.
- Reference baselines from the old implementation (on this machine):
- SHM ping-pong max: C→C ~3.2M, Rust→Rust ~2.9M req/s
- SHM snapshot refresh max: C→C ~2.5M, Go→Go ~1.2M req/s
- UDS ping-pong max: C→C ~220k, Rust→Rust ~240k req/s
- UDS snapshot refresh max: C→C ~194k, Go→Go ~127k req/s
- Negotiated SHM profile: ~2.8M req/s
- Local cache lookup: C ~25M, Rust ~23M, Go ~13M lookups/s
- The new implementation must match or exceed these numbers.
- If performance regresses, investigate and fix before committing.
- Benchmarks must cover the full directed matrix: all C/Rust/Go client-server pairs in both directions (9 pairs for 3 languages).
- Scenarios: max throughput, 100k/s rate-limited, 10k/s rate-limited.
- Each benchmark validates correctness (counter chain verification).
- Benchmark documents (
benchmarks-posix.md,benchmarks-windows.md) are generated from complete runs, never hand-edited.
- Mirror existing Netdata patterns in each language.
- Small files, small functions, single purpose.
- Comments explain "why", not "what".
- Type and function naming follows the Codec spec:
View,Builder,copy/materialize/to_owned. - No over-engineering. No feature flags. No backwards-compatibility shims. Just implement the spec.
src/
libnetdata/netipc/ # C library
include/netipc/ # C public headers
src/
protocol/ # Codec
transport/
posix/ # L1: UDS, SHM
windows/ # L1: Named Pipe, SHM
service/ # L2/L3
crates/netipc/ # Rust crate
src/
protocol.rs # Codec
transport/
posix.rs # L1: UDS, SHM
windows.rs # L1: Named Pipe, SHM
service/ # L2/L3
go/pkg/netipc/ # Go package
protocol/ # Codec
transport/
posix/ # L1: UDS, SHM
windows/ # L1: Named Pipe, SHM
service/ # L2/L3
tests/ # Test scripts and fixtures
bench/ # Benchmark drivers and scripts
docs/ # Specs (already written, do not modify)
- Goal: Remove old implementation, set up build skeleton.
- Actions:
- Delete all files under
src/(old implementation) - Delete all files under
tests/andbench/(old test/bench infra) - Keep
docs/intact - Keep root build files:
CMakeLists.txt,Makefile,go.mod,go.sum - Create empty directory structure per the layout above
- Create minimal
CMakeLists.txtthat configures but builds nothing - Create minimal
Cargo.tomlfor the Rust crate - Verify
go.modis correct - Ensure
cmake -S . -B buildsucceeds
- Delete all files under
- Validation: clean configure, empty build succeeds.
- Commit: "Clean slate: remove old implementation, keep specs"
- Goal: Protocol encode/decode in C, matching the wire envelope spec.
- Scope:
- Outer message header encode/decode (32 bytes)
- Chunk continuation header encode/decode (32 bytes)
- Batch item directory encode/decode
- Hello payload encode/decode (44 bytes)
- Hello-ack payload encode/decode (36 bytes)
- Cgroups snapshot request encode/decode (4 bytes)
- Cgroups snapshot response encode/decode (24-byte header + item directory + items with 32-byte item headers)
- Cgroups snapshot response builder
- All validation rules from the specs
- Files:
src/libnetdata/netipc/include/netipc/netipc_protocol.hsrc/libnetdata/netipc/src/protocol/netipc_protocol.c
- Validation:
- Unit tests for every encode/decode path
- Round-trip tests (encode then decode, verify match)
- Validation rejection tests (truncated, out-of-bounds, missing NUL)
- Compile-time assertions for struct sizes and field offsets
- Commit: "Phase 1: C Codec and L1 wire envelope"
- Goal: Same protocol encode/decode in Rust.
- Scope: Same as Phase 1, in Rust.
- Files:
src/crates/netipc/src/protocol.rs
- Validation:
- Rust unit tests matching Phase 1 coverage
- Round-trip tests
- Validation rejection tests
- File-based C↔Rust interop test: C encodes to file, Rust decodes (and vice versa)
- Commit: "Phase 2: Rust Codec and L1 wire envelope"
- Goal: Same protocol encode/decode in Go.
- Scope: Same as Phase 1, in Go. Pure Go, no cgo.
- Files:
src/go/pkg/netipc/protocol/frame.go
- Validation:
- Go unit tests matching Phase 1 coverage
- Round-trip tests
- Validation rejection tests
- File-based interop test: full C↔Rust↔Go matrix
- Commit: "Phase 3: Go Codec and L1 wire envelope"
- Goal: Fuzz all decode paths in all languages.
- Scope:
- Go:
go test -fuzztargets for all decode functions - Rust:
cargo-fuzztargets or proptest for all decode functions - C: libfuzzer or AFL harness for all decode functions
- Go:
- Validation:
- Each fuzz target runs for at least 30 seconds without crash/panic
- Corpus includes: zero bytes, truncated headers, maximum-size payloads, corrupt offsets, missing NUL terminators
- Commit: "Phase 4: Fuzz testing for all Codec decode paths"
- Goal: UDS SEQPACKET client/server transport in C.
- Scope:
- Socket path derivation
- Listener: bind, listen, accept multiple clients
- Session: handshake (CONTROL messages with hello/hello-ack)
- Send/receive with transparent chunking
- Batch assembly/extraction
- Message_id tracking
- Stale endpoint recovery
- Native fd exposure
- Simple test client and server binaries
- Files:
src/libnetdata/netipc/include/netipc/netipc_uds.hsrc/libnetdata/netipc/src/transport/posix/netipc_uds.ctests/fixtures/c/— test client/server
- Validation:
- Single client ping-pong (send request, receive response)
- Multi-client concurrent sessions
- Pipelining (send N requests, receive N responses out of order)
- Batch send/receive
- Chunking (message larger than packet size)
- Handshake failure tests (bad auth, profile mismatch)
- Stale socket recovery test
- Disconnect with in-flight requests test
- Commit: "Phase 5: C POSIX UDS transport"
- Goal: Same UDS transport in Rust.
- Scope: Same as Phase 5, in Rust.
- Validation:
- Same test coverage as Phase 5
- Live C↔Rust interop: C client → Rust server, Rust client → C server
- Commit: "Phase 6: Rust POSIX UDS transport"
- Goal: Same UDS transport in Go. Pure Go, no cgo.
- Scope: Same as Phase 5, in Go.
- Validation:
- Same test coverage as Phase 5
- Full C↔Rust↔Go live interop matrix (all 9 directed pairs)
- Commit: "Phase 7: Go POSIX UDS transport"
- Goal: Linux SHM transport in C using futex synchronization.
- Scope:
- SHM region creation/destruction (64-byte header)
- Variable-length request/response publication areas
- Spin + futex wait/wake protocol
- SHM file path derivation
- Profile negotiation upgrade from UDS to SHM
- Stale region recovery
- Validation:
- SHM client/server round-trip
- Profile negotiation: both support SHM → upgrades
- Profile negotiation: only one supports SHM → stays UDS
- SHM disconnect detection
- Stale region recovery
- Commit: "Phase 8: C POSIX SHM transport"
- Goal: Same SHM transport in Rust and Go.
- Scope: Same as Phase 8. Go uses pure-Go futex via raw syscall.
- Validation:
- Same test coverage as Phase 8 in each language
- Full C↔Rust↔Go SHM interop matrix
- Negotiated profile interop (mixed SHM/UDS-only peers)
- Commit: "Phase 9: Rust and Go POSIX SHM transport"
- Goal: Level 2 client context and managed server in C.
- Scope:
- Client context: initialize, refresh, ready, status, close
- Typed cgroups snapshot call (using Codec encode/decode)
- At-least-once retry (must reconnect if previously READY)
- Check transport_status before decode
- Managed server: acceptor, fixed worker pool, typed callback dispatch, per-connection write serialization
- Batch dispatch: split across workers, reassemble in order
- Handler failure → INTERNAL_ERROR with empty payload
- Validation:
- Client lifecycle tests (all state transitions)
- Typed call success/failure/retry paths
- Managed server with multiple concurrent clients
- Batch dispatch correctness
- Handler failure handling
- Commit: "Phase 10: C Level 2 orchestration"
- Goal: Same L2 in Rust and Go.
- Validation:
- Same test coverage as Phase 10
- Cross-language L2 matrix: C client → Rust server, Go client → C server, etc.
- Commit: "Phase 11: Rust and Go Level 2 orchestration"
- Goal: Client-side caching for snapshot services in all languages.
- Scope:
- Refresh orchestration (uses L2 typed calls)
- Local cache construction from snapshot response
- Lookup by hash + name
- Cache preservation on refresh failure
- Status reporting (empty / populated)
- Validation:
- Full snapshot round-trip: server (L2) → client (L3) → lookup
- Refresh failure preserves cache
- Reconnect after server restart rebuilds cache
- Empty cache lookup returns not-found
- Large dataset (1000+ items)
- Cross-language: all producer/consumer pairs
- Commit: "Phase 12: Level 3 cache helpers"
- Goal: Performance validation on Linux. Full directed matrix.
- Scope:
- UDS ping-pong: full 9-pair C/Rust/Go matrix at max, 100k/s, 10k/s
- SHM ping-pong: full 9-pair C/Rust/Go matrix at max, 100k/s, 10k/s
- Negotiated profile: UDS vs SHM comparison (Rust→Rust)
- Snapshot refresh baseline: full 9-pair matrix at max, 1000/s
- Snapshot refresh SHM: full 9-pair matrix at max, 1000/s
- Local cache lookup: C, Rust, Go at max, 1000/s
- Benchmark document generation
- Performance floor (below these = bug, investigate and fix):
- SHM ping-pong max: ≥ 1M req/s for all pairs
- SHM snapshot refresh max: ≥ 1M req/s for C/Rust pairs, ≥ 800k for Go pairs
- UDS ping-pong max: ≥ 150k req/s for all pairs
- UDS snapshot refresh max: ≥ 100k req/s for all pairs
- Local cache lookup max: ≥ 10M lookups/s for all languages
- Reference targets (match or exceed the old implementation):
- SHM ping-pong: C→C ~3.2M, Rust→Rust ~2.9M, Go→Go ~1.2M
- SHM snapshot: C→C ~2.5M, Rust→Rust ~2.0M, Go→Go ~1.2M
- UDS ping-pong: C→C ~220k, Rust→Rust ~240k, Go→Go ~164k
- Negotiated SHM: ~2.8M
- Local lookup: C ~25M, Rust ~23M, Go ~13M
- Validation:
- All benchmarks pass correctness checks (counter chain verification)
- All pairs meet the performance floor
benchmarks-posix.mdgenerated from complete run
- Commit: "Phase 13: POSIX benchmarks"
- Goal: Windows Named Pipe transport in C, Rust, Go.
- Scope:
- Pipe name derivation (FNV-1a hash)
- Named Pipe client/server transport
- Handshake over Named Pipe
- Chunking
- MSYS C build compatibility
- Development: local (Linux cross-check where possible)
- Validation:
ssh win11, pull, compile, run tests- Same functional coverage as POSIX UDS phases
- c-native, c-msys, rust-native, go-native interop matrix
- Commit: "Phase 14: Windows Named Pipe transport"
- Goal: Windows SHM transport in C, Rust, Go.
- Scope:
- File mapping with 128-byte header
- Auto-reset kernel events for SHM_HYBRID
- Spin + WaitForSingleObject protocol
- SHM_BUSYWAIT (pure spin) variant
- Profile negotiation upgrade
- Validation:
ssh win11- SHM interop matrix across all 4 implementations
- Profile negotiation (SHM/baseline mixed peers)
- Commit: "Phase 15: Windows SHM transport"
- Goal: Level 2, Level 3, and benchmarks on Windows.
- Validation:
ssh win11- Full L2 cross-language matrix
- L3 snapshot round-trip
- Full directed benchmark matrix: c-native, c-msys, rust-native, go-native (all pairs, all scenarios)
benchmarks-windows.mdgenerated from complete run
- Performance floor (below these = bug):
- Windows SHM max: ≥ 1M req/s for C/Rust pairs
- Windows Named Pipe max: baseline sanity, no regression from old
- Commit: "Phase 16: Windows L2, L3, and benchmarks"
The rewrite is complete when:
- All 16 phases are committed and pushed
- All specs in
docs/are satisfied by the implementation - All cross-language interop tests pass on Linux and Windows
- All fuzz tests pass without crashes
- Benchmark documents are generated from complete runs
- All 4 AI reviewers agree on each phase
- Costa approves the final state