Add resourceVersion conflict checking for optimistic concurrency#76
Open
tgoodwin wants to merge 24 commits intostaleness-fixingfrom
Open
Add resourceVersion conflict checking for optimistic concurrency#76tgoodwin wants to merge 24 commits intostaleness-fixingfrom
tgoodwin wants to merge 24 commits intostaleness-fixingfrom
Conversation
Implement interval-based staleness model
…stively Replace old StalenessConfig/Monte Carlo approach with StalenessInterval-based plans. For each observed (controller, kind) pair, exhaustively enumerate all [staleAt, catchUpAt) windows over [0, maxSeq] (naive strategy that can be refined later)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move applyInputTuning logic into a shared pkg/explore.ApplyInputTuning function that covers all perturbation fields: MaxDepth, PermuteControllers, StaleReads, UserActionReadyDepths, StalenessIntervals, and Search/MonteCarlo. remove the duplicated local implementations from all five example harnesses.
…phase introduces a Phase 6 that guides agents to form targeted perturbation experiments from unverified hypotheses. covers the hypothesis->perturbation decision tree, how to read KindSequences from the trace to configure staleness intervals, variant file conventions, re-run commands, and per-hypothesis outcome tracking.
The rollup functions panicked with "unknown op type" when replaying event sequences containing APPLY operations (Kubernetes server-side apply). This blocked stalenessIntervals from activating in any scenario using SSA (e.g. Crossplane's CompositeReconciler). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eted applications of the order permutation perturbation strategy
…lenessIntervals Adds --no-perturbations flag for a clean reference run from an inputs file that already has permuteControllers or stalenessIntervals configured, without needing to edit the JSON. Also fixes disablePerturbations() to clear StalenessIntervals (the new field); previously only the legacy Staleness map was cleared, meaning closed-loop reference phases were silently applying staleness intervals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rity - Add Run Modes section documenting the three flag-controlled execution modes (reference, exploration, closed-loop) and the agent workflow guidance - Update hypothesis classification table to label permuteControllers as required base and permuteAfterEvent/permuteDepthRange as optional scoping modifiers; add callout explaining AND-logic composition - Expand JSON section with structured explanation of required vs optional permutation fields; document permuteForSteps as planned future mechanism - Update Step 4 run commands to show --perturb=false and --no-perturbations flags Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old name was misleading: --perturb=false didn't mean "no perturbations", it meant "single exploration pass with config as-is". The new --closed-loop name accurately reflects that the flag controls whether the automated reference+rerun pipeline runs. - flags.go: rename perturbFlag -> closedLoopFlag, PerturbEnabled() -> ClosedLoopEnabled() - Update all call sites in parallel_runner.go, flags_test.go, parallel_runner_test.go - Update examples/knative-serving main.go, README.md, AGENTS.md - Update scenario-exploration skill Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Reframe the skill intro as a full 5-step agent loop rather than "analyze a run that was handed to you" - Prerequisites now instructs agents to start with a clean reference run (--closed-loop=false --no-perturbations) before analyzing, and explains why this is the right entry point - Phase 6 context replaces the dry planFn description with a clear framing: the agent IS the perturbation planner, choosing targeted experiments based on what the trace revealed rather than the exhaustive-but-generic automated pipeline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four targeted fixes: - Cycling guidance: "cycling does not mean no findings" — add partial-trace analysis steps and jq command for finding error steps in cycling runs - Phase 2: note that reference runs produce a single state so diff is a no-op; direct agent to Phase 6 for ordering divergence setup - Phase 3: add upfront note that ordering analysis requires an exploration-mode run with permuteControllers; reference-only traces will find nothing here - Phase 3 "reasons an ordering wasn't explored": remove stale closed-loop item, replace with agent-relevant reasons (not in permuteControllers, permuteAfterEvent suppressed trigger, etc.) - Quick Reference: add cycling/partial-trace jq commands (error steps, effect type counts, KindSequence writes, cross-state error comparison) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…currency Implements resourceVersion (RV) semantics in kamera's simulated API server to model Kubernetes optimistic concurrency control. This enables detection of bugs that only manifest when write conflicts occur under informer cache staleness. Infrastructure: - Track integer RVs per resource key per reconcile frame (effectRVs on manager) - Wire rvLookup from explorer's resourceVersions map into manager and containers - Populate effectRVs from ground truth state in PrepareEffectContext Stamping: - Stamp RV onto cache frame objects in doReconcile (via rvStamper on strategy) - Controllers now carry correct RVs through read-modify-write cycles Conflict checking in validateEffect: - Update (PUT): check obj.GetResourceVersion() against current state RV - Patch with OptimisticLock: check preconditions.ResourceVersion against state RV - Patch without OptimisticLock: no check (matches real K8s merge-patch behavior) - APPLY: no check (SSA uses field-manager conflicts, not RV) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When controller-runtime's MergeFromWithOptimisticLock is used, the patch payload embeds the base object's resourceVersion. The API server checks this RV and returns 409 Conflict if it's stale. Uses unsafe.Pointer to read the unexported opts.OptimisticLock and from fields on controller-runtime's mergeFromPatch struct. This avoids calling patch.Data() which has prohibitive performance cost (DeepCopy + double JSON marshal per PATCH call). The extracted RV is passed as a precondition to validateEffect, which checks it against the ground truth RV. This enables conflict detection for FluxCD's patchStatusConditions (which uses OptimisticLock) under informer staleness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes to explore.go: 1. Seed resourceVersions for all initial state objects in Explore() so that RV conflict checking can detect stale writes from the first reconcile step onward. Without this, initial objects have no RV entry and the conflict check is skipped. 2. Remove the no-op skip for status subresource patches. For CRDs (which all kamera harnesses target), Kubernetes bumps resourceVersion even on no-op status patches (kubernetes/kubernetes#67541). Always recording the effect ensures RV advances on every status write, which is required for staleness to produce detectable RV mismatches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tection manager_rv_test.go: - Conflict on stale UPDATE (claimed RV != current RV -> 409) - Success on current UPDATE (RVs match -> OK) - Multi-write within same reconcile (all checked against same baseline) - No-RV write succeeds (empty RV skips check) - APPLY skips RV check (SSA semantics) - RV stamping on resolved objects extract_rv_test.go: - Reflection field inspection on mergeFromPatch struct - OptimisticLock RV extraction via unsafe.Pointer - Non-OptimisticLock patches skip extraction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ness FluxCD source-controller has a bug where transient informer cache staleness (~3 seconds) causes patchStatusConditions to fail all 5 retries with 409 Conflict (due to MergeFromWithOptimisticLock). The error propagates into ComputeReconcileResult which skips setting observedGeneration. The object ends up with a valid artifact but observedGeneration: 0, which is a fixed point that does not self-heal. - BUG-FINDINGS.md: Add FluxCD (1 bug, P1) to the overview table - fluxcd-f1-bug-report.md: Detailed bug report with code links - fluxcd-traces/: Reproduction traces and scenario config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
59564ad to
4131a03
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements resourceVersion (RV) semantics in kamera's simulated API server to model Kubernetes optimistic concurrency control. This enables detection of bugs that only manifest when write conflicts occur under informer cache staleness.
Motivation: FluxCD's source-controller has a hypothesized bug (F1) where transient informer staleness causes
patchStatusConditions(which usesMergeFromWithOptimisticLock) to fail, leavingobservedGenerationstuck at 0 despite a valid artifact. Without RV conflict checking, kamera couldn't reproduce this.Changes
validateEffectrejects writes with stale RVs (Update and OptimisticLock merge-patches)mergeFromPatchviaunsafe.PointerreflectionresourceVersionsfor initial state objects so staleness produces detectable mismatchesFluxCD F1 Bug (reproduced)
With these changes, kamera reproduces the FluxCD F1 bug:
observedGenerationstuck at 0 under ~3s of informer cache staleness. Seeexamples/fluxcd-f1-bug-report.mdfor the full writeup with code links.Test plan
manager_rv_test.go)extract_rv_test.go)examples/fluxcd-traces/)🤖 Generated with Claude Code