- Reviewed
docs/superpowers/plans/2026-03-31-m7-3-m7-5-benchmark-eval-foundation.mdand corrected the plan steps for:- deterministic evaluation accuracy calculation
handleRunEvaluationreply wiring soevaluationResultsis returned together withevaluationJob- evaluation artifact persistence on a fresh
jobs_root - touched-scope coverage commands so benchmark persistence paths are included
- Verification summary for the M7.3-M7.5 plan update:
make proto: passpytesttouched-scope Python suite:50 passed- scratch-path Swift test for
ControlPlaneServiceTests/executeHandlesOpsRunEvaluationThroughTheModelOperationsWorker: pass
- Metrics report:
- changed-line coverage for the touched Python scope:
N/A - reason: the current uncommitted change set for this review transaction is documentation-only, so
scripts/python_changed_line_coverage.pyreportedTOTAL 100.00% 0/0and exited non-zero because there were no measurable changed Python lines
- changed-line coverage for the touched Python scope:
- Audited M6 implementation against child plans.
- Confirmed Python quantization benchmark, gate, and focused test suite pass with explicit
PYTHONPATH. - Identified remaining work for M6 closure:
- benchmark evidence gap for active KV and sparse prefill
- runbook gap for sparse-prefill verification
- lock-scope semantics gap for family or protected-scope conflicts
- Added
docs/plans/2026-03-31-m6-completion-closure.md. - Added
docs/runbooks/m6-acceleration-benchmarks.md. - Added Python tests for:
- linked quantized-artifact upload conflict locking
- sparse-prefill metrics exposure in
phase2_metrics_report.py - sparse-prefill probe collection in the Phase 2 direct worker report
- Updated quantization manifests to carry
protected_scopemetadata. - Updated upload conflict locking to use linked quantization identity before falling back to raw artifact paths.
- Extended
scripts/phase2_metrics_report.pywith aprefill_sparseprobe and sparse-prefill counters in the output. - Verification summary:
pytestfocused M6 Python suite:39 passedscripts/quantization_benchmarks.py --json:profile_count = 7,smoke_pass_rate = 100.0scripts/quantization_release_gate.py --json:passed = truescripts/phase5_model_ops_metrics.py:quantize job_ms=0.965,artifact_bytes=670,manifest_bytes=1923- live
make phase2-metrics --jsonwithMELIX_RUNTIME_DIR=.runtime/m6-phase2:decode_active_kv_quantized.active_kv_quantization_ratio = 25decode_active_kv_quantized.tokens_per_second = 41.22prefill_sparse.sparse_prefill_accepted_skip_count = 1prefill_sparse.accelerated_prefill_gain_pct = 83
- Committed M6 closure as
2f270b9(feat: close m6 acceleration completion gaps). - Began M7 with
docs/plans/2026-03-31-m7-1-m7-2-benchmark-schema-foundation.md. - Landed initial M7 foundation changes in the working tree:
- typed benchmark and evaluation schema messages in control-plane proto
- Python benchmark schema helpers under
worker/productization/benchmark_schemas.py - release-gate benchmark evidence now carries structured
jobandresults - control-plane
ops.run_benchnow assembles typed benchmark job and result payloads
- Verification so far for M7 foundation:
services/mlx-worker-python/tests/test_benchmark_schemas.py: passservices/mlx-worker-python/tests/test_release_gates.py: pass- scratch-path Swift test for
ControlPlaneServiceTests/executeHandlesOpsRunBenchThroughTheModelOperationsWorker: still compiling or pending final result at handoff time