Add resourceVersion conflict checking for optimistic concurrency by tgoodwin · Pull Request #76 · tgoodwin/kamera

tgoodwin · 2026-03-26T02:33:54Z

Summary

Implements resourceVersion (RV) semantics in kamera's simulated API server to model Kubernetes optimistic concurrency control. This enables detection of bugs that only manifest when write conflicts occur under informer cache staleness.

Motivation: FluxCD's source-controller has a hypothesized bug (F1) where transient informer staleness causes patchStatusConditions (which uses MergeFromWithOptimisticLock) to fail, leaving observedGeneration stuck at 0 despite a valid artifact. Without RV conflict checking, kamera couldn't reproduce this.

Changes

RV tracking infrastructure — Track integer RVs per resource key per reconcile frame, populated from ground truth state
RV stamping on cache frames — Controllers now carry correct RVs through read-modify-write cycles
RV conflict checking — validateEffect rejects writes with stale RVs (Update and OptimisticLock merge-patches)
OptimisticLock detection — Extract base RV from controller-runtime's mergeFromPatch via unsafe.Pointer reflection
CRD no-op removal — Always bump RV on status patches (matches real CRD behavior per k8s 1.11 custom object resource version changes on no-op status patch kubernetes/kubernetes#67541)
Initial RV seeding — Seed resourceVersions for initial state objects so staleness produces detectable mismatches

FluxCD F1 Bug (reproduced)

With these changes, kamera reproduces the FluxCD F1 bug: observedGeneration stuck at 0 under ~3s of informer cache staleness. See examples/fluxcd-f1-bug-report.md for the full writeup with code links.

Test plan

6 unit tests for RV conflict checking (manager_rv_test.go)
3 unit tests for OptimisticLock detection (extract_rv_test.go)
All existing tests pass
FluxCD F1 reproduced (traces in examples/fluxcd-traces/)

🤖 Generated with Claude Code

Implement interval-based staleness model

…aleness-fixing

…a driven

…stively Replace old StalenessConfig/Monte Carlo approach with StalenessInterval-based plans. For each observed (controller, kind) pair, exhaustively enumerate all [staleAt, catchUpAt) windows over [0, maxSeq] (naive strategy that can be refined later)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

move applyInputTuning logic into a shared pkg/explore.ApplyInputTuning function that covers all perturbation fields: MaxDepth, PermuteControllers, StaleReads, UserActionReadyDepths, StalenessIntervals, and Search/MonteCarlo. remove the duplicated local implementations from all five example harnesses.

…phase introduces a Phase 6 that guides agents to form targeted perturbation experiments from unverified hypotheses. covers the hypothesis->perturbation decision tree, how to read KindSequences from the trace to configure staleness intervals, variant file conventions, re-run commands, and per-hypothesis outcome tracking.

The rollup functions panicked with "unknown op type" when replaying event sequences containing APPLY operations (Kubernetes server-side apply). This blocked stalenessIntervals from activating in any scenario using SSA (e.g. Crossplane's CompositeReconciler). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eted applications of the order permutation perturbation strategy

…lenessIntervals Adds --no-perturbations flag for a clean reference run from an inputs file that already has permuteControllers or stalenessIntervals configured, without needing to edit the JSON. Also fixes disablePerturbations() to clear StalenessIntervals (the new field); previously only the legacy Staleness map was cleared, meaning closed-loop reference phases were silently applying staleness intervals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rity - Add Run Modes section documenting the three flag-controlled execution modes (reference, exploration, closed-loop) and the agent workflow guidance - Update hypothesis classification table to label permuteControllers as required base and permuteAfterEvent/permuteDepthRange as optional scoping modifiers; add callout explaining AND-logic composition - Expand JSON section with structured explanation of required vs optional permutation fields; document permuteForSteps as planned future mechanism - Update Step 4 run commands to show --perturb=false and --no-perturbations flags Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The old name was misleading: --perturb=false didn't mean "no perturbations", it meant "single exploration pass with config as-is". The new --closed-loop name accurately reflects that the flag controls whether the automated reference+rerun pipeline runs. - flags.go: rename perturbFlag -> closedLoopFlag, PerturbEnabled() -> ClosedLoopEnabled() - Update all call sites in parallel_runner.go, flags_test.go, parallel_runner_test.go - Update examples/knative-serving main.go, README.md, AGENTS.md - Update scenario-exploration skill Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Reframe the skill intro as a full 5-step agent loop rather than "analyze a run that was handed to you" - Prerequisites now instructs agents to start with a clean reference run (--closed-loop=false --no-perturbations) before analyzing, and explains why this is the right entry point - Phase 6 context replaces the dry planFn description with a clear framing: the agent IS the perturbation planner, choosing targeted experiments based on what the trace revealed rather than the exhaustive-but-generic automated pipeline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Four targeted fixes: - Cycling guidance: "cycling does not mean no findings" — add partial-trace analysis steps and jq command for finding error steps in cycling runs - Phase 2: note that reference runs produce a single state so diff is a no-op; direct agent to Phase 6 for ordering divergence setup - Phase 3: add upfront note that ordering analysis requires an exploration-mode run with permuteControllers; reference-only traces will find nothing here - Phase 3 "reasons an ordering wasn't explored": remove stale closed-loop item, replace with agent-relevant reasons (not in permuteControllers, permuteAfterEvent suppressed trigger, etc.) - Quick Reference: add cycling/partial-trace jq commands (error steps, effect type counts, KindSequence writes, cross-state error comparison) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…currency Implements resourceVersion (RV) semantics in kamera's simulated API server to model Kubernetes optimistic concurrency control. This enables detection of bugs that only manifest when write conflicts occur under informer cache staleness. Infrastructure: - Track integer RVs per resource key per reconcile frame (effectRVs on manager) - Wire rvLookup from explorer's resourceVersions map into manager and containers - Populate effectRVs from ground truth state in PrepareEffectContext Stamping: - Stamp RV onto cache frame objects in doReconcile (via rvStamper on strategy) - Controllers now carry correct RVs through read-modify-write cycles Conflict checking in validateEffect: - Update (PUT): check obj.GetResourceVersion() against current state RV - Patch with OptimisticLock: check preconditions.ResourceVersion against state RV - Patch without OptimisticLock: no check (matches real K8s merge-patch behavior) - APPLY: no check (SSA uses field-manager conflicts, not RV) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When controller-runtime's MergeFromWithOptimisticLock is used, the patch payload embeds the base object's resourceVersion. The API server checks this RV and returns 409 Conflict if it's stale. Uses unsafe.Pointer to read the unexported opts.OptimisticLock and from fields on controller-runtime's mergeFromPatch struct. This avoids calling patch.Data() which has prohibitive performance cost (DeepCopy + double JSON marshal per PATCH call). The extracted RV is passed as a precondition to validateEffect, which checks it against the ground truth RV. This enables conflict detection for FluxCD's patchStatusConditions (which uses OptimisticLock) under informer staleness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two changes to explore.go: 1. Seed resourceVersions for all initial state objects in Explore() so that RV conflict checking can detect stale writes from the first reconcile step onward. Without this, initial objects have no RV entry and the conflict check is skipped. 2. Remove the no-op skip for status subresource patches. For CRDs (which all kamera harnesses target), Kubernetes bumps resourceVersion even on no-op status patches (kubernetes/kubernetes#67541). Always recording the effect ensures RV advances on every status write, which is required for staleness to produce detectable RV mismatches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tection manager_rv_test.go: - Conflict on stale UPDATE (claimed RV != current RV -> 409) - Success on current UPDATE (RVs match -> OK) - Multi-write within same reconcile (all checked against same baseline) - No-RV write succeeds (empty RV skips check) - APPLY skips RV check (SSA semantics) - RV stamping on resolved objects extract_rv_test.go: - Reflection field inspection on mergeFromPatch struct - OptimisticLock RV extraction via unsafe.Pointer - Non-OptimisticLock patches skip extraction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ness FluxCD source-controller has a bug where transient informer cache staleness (~3 seconds) causes patchStatusConditions to fail all 5 retries with 409 Conflict (due to MergeFromWithOptimisticLock). The error propagates into ComputeReconcileResult which skips setting observedGeneration. The object ends up with a valid artifact but observedGeneration: 0, which is a fixed point that does not self-heal. - BUG-FINDINGS.md: Add FluxCD (1 bug, P1) to the overview table - fluxcd-f1-bug-report.md: Detailed bug report with code links - fluxcd-traces/: Reproduction traces and scenario config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tgoodwin and others added 24 commits March 11, 2026 20:01

Merge pull request #75 from tgoodwin/feature/interval-based-staleness

34c6ccb

Implement interval-based staleness model

Merge branch 'staleness-fixing' of github.com:tgoodwin/kamera into st…

0ebc193

…aleness-fixing

add blurb to AGENTS.md about using the campaign-metrics CLI to be dat…

55c23a9

…a driven

temp check in EOD

7ece45e

move scenario jsons and bug report mds to examples/crossplane/scenarios/

2c71302

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

increase default exploration depth from 10 to 100

6000867

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

implement a "permuteAfterEvent" and "permuteDepthRange" to allow targ…

7545068

…eted applications of the order permutation perturbation strategy

update skill with new permute config mechanisms

50f3e10

mark individual scenario .md files as stale in favor of ANALYSIS.md

932ac65

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

have analysis feed back into the scenario-design skill

59564ad

tgoodwin force-pushed the staleness-fixing branch from 59564ad to 4131a03 Compare March 26, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resourceVersion conflict checking for optimistic concurrency#76

Add resourceVersion conflict checking for optimistic concurrency#76
tgoodwin wants to merge 24 commits intostaleness-fixingfrom
rv-conflict-checking

tgoodwin commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tgoodwin commented Mar 26, 2026

Summary

Changes

FluxCD F1 Bug (reproduced)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant