staleness perturbation strategy improvements by tgoodwin · Pull Request #74 · tgoodwin/kamera

tgoodwin · 2026-03-10T20:06:25Z

check in crossplane bug report accumulator doc
add StaleReadInfo type and wire staleness info through dump serialization
add ResourceVersion tracking and fix stateEvent hash consistency
add staleness perturbation planner and wire into auto-closed-loop
add regression tests for stateEvent hash consistency invariant
move resourceVersions seeding into buildStartStateFromObjects
add deterministic cloud provider for reproducible karpenter node names
replace karpenter lifecycle/consistency controllers with lightweight launcher
update karpenter example README for lifecycle controller removal

…tion Add StaleReadInfo to ReconcileResult for tracking per-object ResourceVersion gaps when a controller observes stale state. Wire the field through DumpReconcileResult and inspector_dump.go serialization/deserialization so staleness data flows end-to-end into JSONL output.

Add resourceVersions map to Explorer, tracking the global RV for each VersionHash. Seed from initial state events and populate incrementally in applyEffects on each write. Fix root cause of staleness info producing observedRV=0: stateEvents were recording the original pre-modification effect hash, while Contents.contents stored the post-Generation-bump hash. Now create a recordedEffect copy with the post-modification hash so ObserveAt replay produces hashes that match Contents and can be looked up in resourceVersions. Also add computeStalenessInfo which compares stale vs current objects per reconcile step, and refine stale view generation to only count restarts when actual stale alternatives exist.

Add observedReadKindsPerController to extract per-controller read kinds from reference trace observations. Add buildStalenessPerturbationPlans which generates Monte Carlo staleness phases targeting all observed read kinds with lookback=2 and maxRestarts=1, producing 5 trials per phase. Wire staleness plans into the default planFn in runScenario so they run as a separate phase after ordering perturbation, keeping the two orthogonal.

Test that after applyEffects processes write effects with generation bumps (which re-publish objects with new hashes), the stateEvent records the post-modification hash matching Contents.contents — not the original pre-bump hash. Covers CREATE, UPDATE, APPLY, and multi-effect paths. This invariant is critical for ObserveAt replay correctness: if stateEvent hashes diverge from Contents hashes, staleness simulation produces lookups that miss in resourceVersions, silently yielding observedRV=0.

Register RVs at the same site where stateEvents are created, rather than in a separate seeding loop in Explore(). This makes it structurally impossible to construct a start state without populating resourceVersions. ExplorerBuilder now owns the map, initialized in NewExplorerBuilder, transferred to Explorer in Build(), and freshly allocated in Fork(). TraceChecker passes nil since it doesn't use staleness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The fake CloudProvider uses test.RandomProviderID() seeded with time.Now().UnixNano(), producing nondeterministic ProviderIDs across runs. Wrap it with deterministicCloudProvider that derives ProviderIDs from the NodeClaim name (which is deterministic via nameGeneratingClient). Add Reset() to nameGeneratingClient so the counter can be zeroed between Monte Carlo trials. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…launcher The full lifecycle controller caused convergence failures because: - Registration uses MatchingFields queries unsupported by the replay client - Liveness generates RequeueAfter creating unbounded exploration depth - Shared mutable state (nodepoolhealth, go-cache) leaked across trials Replace with nodeClaimLauncher that only handles the launch stage (cp.Create + populate NodeClaim details). Update nodeRegistrar to create Nodes that are immediately ready with labels, allocatable/capacity, and NodeReady condition. Add OnFork callback to reset shared in-memory state (cluster, cloud provider, name counter) between Monte Carlo trials. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reflect the switch from nodeclaim.lifecycle/consistency to the lightweight nodeClaimLauncher, the deterministic cloud provider, Monte Carlo trial reset via OnFork, and document known limitations around cluster sync and convergence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Workflow JSONs specify ownerReferences with empty UIDs since the real UID isn't known at authoring time. After EnsureDeterministicIdentity assigns stable UIDs to all objects, a second pass now patches ownerReference UIDs to match their owners in the same object set.

In real Kubernetes, a reconciler returning an error causes controller-runtime to requeue with exponential backoff (5ms to ~16min). The previous behavior of abandoning the exploration branch on error was overly pessimistic and prevented distinguishing transient errors (recoverable once other controllers advance state) from permanent ones. Now reconciler errors are treated as no-ops and the reconciler is re-enqueued, matching real controller-runtime semantics.

When the workflow JSON specifies targeted staleReads/staleLookback tuning, use it directly in the rerun phase instead of auto-deriving from observed reference trace reads. Fall back to auto-derived staleness only when no explicit config is provided. Also fix disablePerturbations to preserve UserActionReadyDepths (scheduling metadata, not a perturbation) and ensure the map is initialized to prevent nil map panics in downstream callers.

Replace per-step combinatorial branching with declarative staleness intervals. A StalenessInterval defines a window [StaleAt, CatchUpAt) during which a reconciler's view of a specific resource kind lags behind actual cluster state, either frozen at StaleAt (Lag=-1) or trailing by a fixed offset. Key changes: - Add StalenessInterval type to PerturbationConfig - evaluateStalenessIntervals on StateNode computes stale KindSequences - ObserveAs prioritizes interval evaluation over stuckReconcilerPositions - getPossibleViewsForReconcile skips branching when intervals configured - materializeNextState computes StalenessInfo from intervals - determineNewPendingReconciles filters triggers for stale kinds - Parse stalenessIntervals from workflow JSON via coverage.InputTuning

Two crossplane workflow inputs exercising the new stalenessIntervals config: - composition-update-races-xr-fetch: CompositionReconciler sees frozen CompositionRevision during [6,10) window while CompositionRevisionReconciler updates it. Verified end-to-end with stalenessInfo in dump output. - function-capability-removed: CompositionRevisionReconciler sees frozen FunctionRevision during [1,4) window after capability removal.

Five tests using FooController + BarController (real reconcilers with mode-driven state machines) to exercise staleness intervals end-to-end: - Baseline: 2 converged states (A-Final, B-Final) without intervals - FrozenReconcilerPreventsModeFlip: freezing BarController on Foo eliminates B-Final path (BarController never sees A-1 to flip mode) - StalenessInfoPopulated: verifies ReconcileResult carries per-object staleness details with correct lag/observedRV/currentRV - TriggerFilteringReducesBarActivity: stale reconciler runs fewer steps because trigger filtering suppresses re-triggering - WindowOutsideRange: inactive window [50,100) matches baseline, confirming intervals have no effect outside their sequence range

Implement interval-based staleness model

…aleness-fixing

…a driven

…stively Replace old StalenessConfig/Monte Carlo approach with StalenessInterval-based plans. For each observed (controller, kind) pair, exhaustively enumerate all [staleAt, catchUpAt) windows over [0, maxSeq] (naive strategy that can be refined later)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

move applyInputTuning logic into a shared pkg/explore.ApplyInputTuning function that covers all perturbation fields: MaxDepth, PermuteControllers, StaleReads, UserActionReadyDepths, StalenessIntervals, and Search/MonteCarlo. remove the duplicated local implementations from all five example harnesses.

…phase introduces a Phase 6 that guides agents to form targeted perturbation experiments from unverified hypotheses. covers the hypothesis->perturbation decision tree, how to read KindSequences from the trace to configure staleness intervals, variant file conventions, re-run commands, and per-hypothesis outcome tracking.

- replayDynamicClient: bridges client.Client → dynamic.Interface - replayClientSet: implements kroclient.SetInterface for replay mode - replayCRDClient: CRD Ensure/Delete/Get via replay client Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace all hand-rolled controller simulations with real KRO Instance Controller and ResourceGraphDefinitionReconciler. Both controllers run real reconciliation logic through replay client adapters. - Instance controller: real SSA Apply on child resources, requeue handling - RGD controller: real graph building, CRD generation, status updates - Graph built from RGD spec using core schema resolver (no API server) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ution Both real KRO controllers execute through the kamera harness: - RGD controller reconciles (CRD creation, status update) - Instance controller runs (hits reconcile error, needs debugging) - Graph built using schemaless parsing (nil schema resolver) Uses typed RGD via generator package for correct API structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Instance Controller uses SSA Apply with partial patches (e.g., only metadata fields). After Apply, the controller expects the full merged object back. The dynamic adapter now re-reads via Get after Patch to return the complete state. Remaining issue: Instance Controller's SSA Apply for child resources (Deployment, Service, Ingress) creates them, but the replay client's applyEffects may not be properly upserting new objects. The controller reports "waiting for unresolved resource deployment". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add missing 'image' field to Application instance fixture (CEL expressions reference schema.spec.image which triggers "no such key" → ErrDataPending without the default value) - Apply re-reads full object after Patch to return merged state Both real KRO controllers now execute end-to-end: RGD creates CRD, Instance Controller creates Deployment/Service/Ingress via SSA Apply, Deployment controller creates ReplicaSet/Pods, status updates propagate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When an APPLY effect doesn't change the object's spec, skip the effect entirely. This prevents controllers that unconditionally re-apply the same desired state from creating infinite reconcile loops. Also increased max depth to 60 for KRO harness (real controllers produce more cascading state changes than simulations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- K1 scenario tests ordering perturbation between RGD and Instance controllers with ingress enabled - When JSON inputs have no environmentState, fall back to default RGD state (avoids needing full typed RGD in JSON) - Results: 4 converged states, all same hash — no ordering-dependent divergence detected Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents: - Controller surface map (both real KRO controllers) - Surgical KRO changes (5 commits, all non-behavioral) - Harness architecture diagram - K1 ordering scenario results (no divergence) - Known limitations and Approach B fallback path - SSA Apply idempotency fix - Area coverage assessment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

K2: Fault injection — Instance Controller crash after 2nd/3rd write (partial child resource apply, tests SSA idempotency on retry) K3: Staleness — Instance Controller reads stale Application during replicas scale-up (1→3), tests intermediate state correctness K4: Ingress toggle — includeWhen condition changes with staleness, tests conditional resource creation/deletion under cache lag K5: Fault injection — RGD Controller crash after CRD create / after status update (tests CRD exists but status stale) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…fault injection Instance Controller crash mid-apply produces 12 distinct final states across 628 runs. The applyset's parallel child Apply + parent metadata patch creates a vulnerability window where partial applies combined with ApplySet label state leads to inconsistent prune/conflict behavior. K5 (RGD fault injection): no divergence — fully idempotent. K3, K4: need staleness config tuning for permuted runs. Evidence persisted in .agents/evidence/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…final state State analysis reveals crash recovery never recreates full child set: - 425 runs: only Ingress survives (applied before crash) - 186 runs: no children at all - 17 runs: varied Application content (rare states) No final state includes Deployment or Service. The ApplySet metadata mismatch after partial apply prevents correct re-creation on retry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Even with triggerOnce (single crash, then clean recovery), the Instance Controller fails to recreate all children. Across 2201 runs: - 1439: only Ingress (no Deployment, no Service) - 406: Deployment + Ingress (no Service!) - 196: no children at all Service is NEVER present in ANY final state. This is a confirmed crash recovery bug in the Instance Controller's applyset implementation. Updated ANALYSIS.md with exact source code references to the write sequence and vulnerability windows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

K6: Tests Application deletion with ordering perturbation: - Delete with ingress enabled - Delete without ingress - Fault injection during deletion (crash after 1st delete write) K7: Tests rapid back-to-back spec changes: - Triple scale: replicas 1→3→5 - Combined: toggle ingress + scale + re-toggle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… CRD deletion K6 fault injection during deletion: 6 distinct final states across 304 runs: - 26 runs: Application survives deletion (finalizer not removed) - 2 runs: complete deletion failure (all 9 objects survive) - 7 runs: CRD deleted despite allowCRDDeletion=false K7 rapid spec changes: no divergence (reference only). K3/K4 staleness: updated to staleReads format, no divergence found. Updated ANALYSIS.md with full area coverage assessment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

High-level rollup of all confirmed bugs across projects: - Karpenter: 12 bugs (7×P1, 4×P2, 1×P3) - Crossplane: 9 findings (cycling blocks full convergence analysis) - Kratix: 6 bugs (5×P2, 1×P3) - KRO: 0 bugs (analyzed, clean) - Total: 27 confirmed findings across 4 projects Each project links to its detailed ANALYSIS.md for reproduction steps and trace evidence.

…ay client

…ionale Each of the 5 kro changes now documented with exact commit hash, file path, what changed, and why it's necessary. Includes note that changes are required regardless of harness location (Go type system constraints). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dle unstructured SSA Apply Three simulation fidelity fixes discovered while building the CAPI harness: 1. Errored reconcile effects discarded: when a controller returned an error mid-reconcile, all API writes that had already succeeded were thrown away. In real K8s these writes are durable. Now doReconcile always retrieves effects and reconcileAtState passes them through alongside the error. 2. applyConfigToUnstructured only extracted name/ns/kind/apiVersion from typed apply configs. CAPI uses unstructuredApplyConfiguration (wrapping *Unstructured) for SSA — added fallback via DeepCopyObject to preserve the full object including spec/status. 3. Multi-write version resolution used changes[effect.Key] (last-writer-wins map) instead of effect.Version for each individual effect. Also refined mergeStatusSubresourceObject to distinguish PATCH (preserve status on missing) from UPDATE (clear status on missing) semantics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

K2b: Instance Controller crash mid-apply — Service never created (P1) K6: Instance Controller crash during deletion — orphaned children (P1) Total confirmed bugs across all projects: 29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # pkg/tracecheck/explore.go

…ntics Real K8s API server (v1.21+) detects SSA Apply/Patch operations that produce no field changes and skips the etcd write — no resourceVersion bump, no watch event. Kamera was recording every Apply/Patch as an effect regardless of content change, causing infinite reconcile cycles with controllers that unconditionally re-apply desired state (e.g., CAPI MachineSet syncMachines). Added no-op detection at three points in applyEffects: - APPLY: compare spec + metadata of new vs existing object - PATCH status subresource: compare merged hash vs existing hash - PATCH non-status: compare effect hash vs existing state hash Also adds metadataEqual helper that compares labels, annotations, ownerRefs, and finalizers while ignoring server-managed fields. Note: hash-based comparison for non-status PATCH is not yet catching all no-op cases — the replay client may produce different hashes for logically identical objects due to field ordering or empty-vs-nil differences. Further investigation needed.

…enarios (3 bugs found) Wire ClaimReconciler into harness with deterministic name generator and no-op ConnectionPropagator. Upgrade Crossplane v2.1.0 → v2.2.0 for controller-runtime v0.23.0 compatibility. Add configurable stubFunctionRunner supporting 4 behaviors (default, fatal, different-resources, partial-readiness) keyed by function name in the Composition pipeline. New findings: - F6 (P2): Function switch to fatal orphans composed resources + stale Ready=True — 7 distinct terminal states across 49 trials - C2 (P2): Claim deletion orphans XR + composed resources — 98% orphan rate across 98 trials - C4 (P2): Two XRs silently steal ownership of same composed resource — ordering-dependent ownerReference on shared ConfigMap Clean scenarios: F7 (resource switch GC works), F8 (flap recovery), C1 (claim-XR binding), C3 (claim crash recovery), C5 (manual policy). Also corrects Crossplane bug count in BUG-FINDINGS.md from 9 to 7 (re-evaluated against ANALYSIS.md evidence) and adds KCP + Cluster API sections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the TODO block with a proper methodology section explaining how we found 7 bugs despite all scenarios cycling (F2). Three workarounds: Monte Carlo aborted-state hash comparison, fixed-depth event injection, and leveraging divergence on non-cycling objects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…projects KCP additions: - KCP17: Pure ordering bug from apibinding reconciler (P2) - KCP18b: Crash recovery divergence from fault injection (P2) Three distinct root causes identified: 1. Endpoint condition chain race (no sync beyond watches) 2. Concurrent condition write conflict (multiple LC writers) 3. Crash recovery divergence (multi-write reconciler crash) 9 controllers wired, 20 scenarios, 3 regions explored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove examples/kro/kro from tracking and add it to .gitignore. These compiled binaries should be built locally, not checked in. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tgoodwin and others added 30 commits March 10, 2026 10:55

check in crossplane bug report accumulator doc

d28b4da

change karpenter step size

ae86590

check in new crossplane workflow analysis

dd000f0

add matchesFilter semantics to replay client List()

15a9a14

update agent instructions

461984c

fix status subresource effect handling

af250b2

Merge pull request #75 from tgoodwin/feature/interval-based-staleness

141af54

Implement interval-based staleness model

crossplane harness fixes

c968dd9

Merge branch 'staleness-fixing' of github.com:tgoodwin/kamera into st…

140e1d3

…aleness-fixing

add blurb to AGENTS.md about using the campaign-metrics CLI to be dat…

5d68068

…a driven

temp check in EOD

09842ae

move scenario jsons and bug report mds to examples/crossplane/scenarios/

23c2adf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

increase default exploration depth from 10 to 100

7fd301b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tgoodwin and others added 27 commits March 16, 2026 22:14

fix SSA semantics with status updates and add Scheme() method to repl…

10cd9f8

…ay client

examples: symlink kcp harness to in-repo location at kcp/kamera/

d0db0f6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'kro-real-controllers' into staleness-fixing

892bbac

# Conflicts: # pkg/tracecheck/explore.go

add executive summary to KRO analysis file

920449d

Remove large harness binaries from git tracking

4131a03

Remove examples/kro/kro from tracking and add it to .gitignore. These compiled binaries should be built locally, not checked in. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tgoodwin changed the title ~~staleness perturbation strategy~~ staleness perturbation strategy improvements Mar 26, 2026

tgoodwin force-pushed the staleness-fixing branch from 59564ad to 4131a03 Compare March 26, 2026 14:58

Remove deprecated TranslateHotspots tests

d629dde

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

staleness perturbation strategy improvements#74

staleness perturbation strategy improvements#74
tgoodwin wants to merge 108 commits intomainfrom
staleness-fixing

tgoodwin commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tgoodwin commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant