Skip to content

Add resourceVersion conflict checking for optimistic concurrency#76

Open
tgoodwin wants to merge 24 commits intostaleness-fixingfrom
rv-conflict-checking
Open

Add resourceVersion conflict checking for optimistic concurrency#76
tgoodwin wants to merge 24 commits intostaleness-fixingfrom
rv-conflict-checking

Conversation

@tgoodwin
Copy link
Copy Markdown
Owner

Summary

Implements resourceVersion (RV) semantics in kamera's simulated API server to model Kubernetes optimistic concurrency control. This enables detection of bugs that only manifest when write conflicts occur under informer cache staleness.

Motivation: FluxCD's source-controller has a hypothesized bug (F1) where transient informer staleness causes patchStatusConditions (which uses MergeFromWithOptimisticLock) to fail, leaving observedGeneration stuck at 0 despite a valid artifact. Without RV conflict checking, kamera couldn't reproduce this.

Changes

  • RV tracking infrastructure — Track integer RVs per resource key per reconcile frame, populated from ground truth state
  • RV stamping on cache frames — Controllers now carry correct RVs through read-modify-write cycles
  • RV conflict checkingvalidateEffect rejects writes with stale RVs (Update and OptimisticLock merge-patches)
  • OptimisticLock detection — Extract base RV from controller-runtime's mergeFromPatch via unsafe.Pointer reflection
  • CRD no-op removal — Always bump RV on status patches (matches real CRD behavior per k8s 1.11 custom object resource version changes on no-op status patch kubernetes/kubernetes#67541)
  • Initial RV seeding — Seed resourceVersions for initial state objects so staleness produces detectable mismatches

FluxCD F1 Bug (reproduced)

With these changes, kamera reproduces the FluxCD F1 bug: observedGeneration stuck at 0 under ~3s of informer cache staleness. See examples/fluxcd-f1-bug-report.md for the full writeup with code links.

Test plan

  • 6 unit tests for RV conflict checking (manager_rv_test.go)
  • 3 unit tests for OptimisticLock detection (extract_rv_test.go)
  • All existing tests pass
  • FluxCD F1 reproduced (traces in examples/fluxcd-traces/)

🤖 Generated with Claude Code

tgoodwin and others added 24 commits March 11, 2026 20:01
…stively

Replace old StalenessConfig/Monte Carlo approach with StalenessInterval-based
plans. For each observed (controller, kind) pair, exhaustively enumerate all
[staleAt, catchUpAt) windows over [0, maxSeq] (naive strategy that can be refined later)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move applyInputTuning logic into a shared pkg/explore.ApplyInputTuning function
that covers all perturbation fields: MaxDepth, PermuteControllers, StaleReads,
UserActionReadyDepths, StalenessIntervals, and Search/MonteCarlo. remove the
duplicated local implementations from all five example harnesses.
…phase

introduces a Phase 6 that guides agents to form targeted perturbation experiments
from unverified hypotheses. covers the hypothesis->perturbation decision tree,
how to read KindSequences from the trace to configure staleness intervals, variant
file conventions, re-run commands, and per-hypothesis outcome tracking.
The rollup functions panicked with "unknown op type" when replaying
event sequences containing APPLY operations (Kubernetes server-side
apply). This blocked stalenessIntervals from activating in any scenario
using SSA (e.g. Crossplane's CompositeReconciler).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eted applications of the order permutation perturbation strategy
…lenessIntervals

Adds --no-perturbations flag for a clean reference run from an inputs file
that already has permuteControllers or stalenessIntervals configured,
without needing to edit the JSON.

Also fixes disablePerturbations() to clear StalenessIntervals (the new
field); previously only the legacy Staleness map was cleared, meaning
closed-loop reference phases were silently applying staleness intervals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rity

- Add Run Modes section documenting the three flag-controlled execution modes
  (reference, exploration, closed-loop) and the agent workflow guidance
- Update hypothesis classification table to label permuteControllers as
  required base and permuteAfterEvent/permuteDepthRange as optional scoping
  modifiers; add callout explaining AND-logic composition
- Expand JSON section with structured explanation of required vs optional
  permutation fields; document permuteForSteps as planned future mechanism
- Update Step 4 run commands to show --perturb=false and --no-perturbations flags

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old name was misleading: --perturb=false didn't mean "no perturbations",
it meant "single exploration pass with config as-is". The new --closed-loop
name accurately reflects that the flag controls whether the automated
reference+rerun pipeline runs.

- flags.go: rename perturbFlag -> closedLoopFlag, PerturbEnabled() -> ClosedLoopEnabled()
- Update all call sites in parallel_runner.go, flags_test.go, parallel_runner_test.go
- Update examples/knative-serving main.go, README.md, AGENTS.md
- Update scenario-exploration skill

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Reframe the skill intro as a full 5-step agent loop rather than
  "analyze a run that was handed to you"
- Prerequisites now instructs agents to start with a clean reference run
  (--closed-loop=false --no-perturbations) before analyzing, and explains
  why this is the right entry point
- Phase 6 context replaces the dry planFn description with a clear framing:
  the agent IS the perturbation planner, choosing targeted experiments based
  on what the trace revealed rather than the exhaustive-but-generic automated pipeline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four targeted fixes:
- Cycling guidance: "cycling does not mean no findings" — add partial-trace
  analysis steps and jq command for finding error steps in cycling runs
- Phase 2: note that reference runs produce a single state so diff is a no-op;
  direct agent to Phase 6 for ordering divergence setup
- Phase 3: add upfront note that ordering analysis requires an exploration-mode
  run with permuteControllers; reference-only traces will find nothing here
- Phase 3 "reasons an ordering wasn't explored": remove stale closed-loop item,
  replace with agent-relevant reasons (not in permuteControllers, permuteAfterEvent
  suppressed trigger, etc.)
- Quick Reference: add cycling/partial-trace jq commands (error steps, effect
  type counts, KindSequence writes, cross-state error comparison)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…currency

Implements resourceVersion (RV) semantics in kamera's simulated API server to
model Kubernetes optimistic concurrency control. This enables detection of bugs
that only manifest when write conflicts occur under informer cache staleness.

Infrastructure:
- Track integer RVs per resource key per reconcile frame (effectRVs on manager)
- Wire rvLookup from explorer's resourceVersions map into manager and containers
- Populate effectRVs from ground truth state in PrepareEffectContext

Stamping:
- Stamp RV onto cache frame objects in doReconcile (via rvStamper on strategy)
- Controllers now carry correct RVs through read-modify-write cycles

Conflict checking in validateEffect:
- Update (PUT): check obj.GetResourceVersion() against current state RV
- Patch with OptimisticLock: check preconditions.ResourceVersion against state RV
- Patch without OptimisticLock: no check (matches real K8s merge-patch behavior)
- APPLY: no check (SSA uses field-manager conflicts, not RV)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When controller-runtime's MergeFromWithOptimisticLock is used, the patch
payload embeds the base object's resourceVersion. The API server checks this
RV and returns 409 Conflict if it's stale.

Uses unsafe.Pointer to read the unexported opts.OptimisticLock and from
fields on controller-runtime's mergeFromPatch struct. This avoids calling
patch.Data() which has prohibitive performance cost (DeepCopy + double JSON
marshal per PATCH call).

The extracted RV is passed as a precondition to validateEffect, which checks
it against the ground truth RV. This enables conflict detection for FluxCD's
patchStatusConditions (which uses OptimisticLock) under informer staleness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes to explore.go:

1. Seed resourceVersions for all initial state objects in Explore() so that
   RV conflict checking can detect stale writes from the first reconcile
   step onward. Without this, initial objects have no RV entry and the
   conflict check is skipped.

2. Remove the no-op skip for status subresource patches. For CRDs (which all
   kamera harnesses target), Kubernetes bumps resourceVersion even on no-op
   status patches (kubernetes/kubernetes#67541). Always recording the effect
   ensures RV advances on every status write, which is required for staleness
   to produce detectable RV mismatches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tection

manager_rv_test.go:
- Conflict on stale UPDATE (claimed RV != current RV -> 409)
- Success on current UPDATE (RVs match -> OK)
- Multi-write within same reconcile (all checked against same baseline)
- No-RV write succeeds (empty RV skips check)
- APPLY skips RV check (SSA semantics)
- RV stamping on resolved objects

extract_rv_test.go:
- Reflection field inspection on mergeFromPatch struct
- OptimisticLock RV extraction via unsafe.Pointer
- Non-OptimisticLock patches skip extraction

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ness

FluxCD source-controller has a bug where transient informer cache staleness
(~3 seconds) causes patchStatusConditions to fail all 5 retries with 409
Conflict (due to MergeFromWithOptimisticLock). The error propagates into
ComputeReconcileResult which skips setting observedGeneration. The object
ends up with a valid artifact but observedGeneration: 0, which is a fixed
point that does not self-heal.

- BUG-FINDINGS.md: Add FluxCD (1 bug, P1) to the overview table
- fluxcd-f1-bug-report.md: Detailed bug report with code links
- fluxcd-traces/: Reproduction traces and scenario config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant