Skip to content

staleness perturbation strategy improvements#74

Open
tgoodwin wants to merge 108 commits intomainfrom
staleness-fixing
Open

staleness perturbation strategy improvements#74
tgoodwin wants to merge 108 commits intomainfrom
staleness-fixing

Conversation

@tgoodwin
Copy link
Copy Markdown
Owner

  • check in crossplane bug report accumulator doc
  • add StaleReadInfo type and wire staleness info through dump serialization
  • add ResourceVersion tracking and fix stateEvent hash consistency
  • add staleness perturbation planner and wire into auto-closed-loop
  • add regression tests for stateEvent hash consistency invariant
  • move resourceVersions seeding into buildStartStateFromObjects
  • add deterministic cloud provider for reproducible karpenter node names
  • replace karpenter lifecycle/consistency controllers with lightweight launcher
  • update karpenter example README for lifecycle controller removal

tgoodwin and others added 30 commits March 10, 2026 10:55
…tion

Add StaleReadInfo to ReconcileResult for tracking per-object ResourceVersion
gaps when a controller observes stale state. Wire the field through
DumpReconcileResult and inspector_dump.go serialization/deserialization so
staleness data flows end-to-end into JSONL output.
Add resourceVersions map to Explorer, tracking the global RV for each
VersionHash. Seed from initial state events and populate incrementally in
applyEffects on each write.

Fix root cause of staleness info producing observedRV=0: stateEvents were
recording the original pre-modification effect hash, while Contents.contents
stored the post-Generation-bump hash. Now create a recordedEffect copy with
the post-modification hash so ObserveAt replay produces hashes that match
Contents and can be looked up in resourceVersions.

Also add computeStalenessInfo which compares stale vs current objects per
reconcile step, and refine stale view generation to only count restarts when
actual stale alternatives exist.
Add observedReadKindsPerController to extract per-controller read kinds from
reference trace observations. Add buildStalenessPerturbationPlans which
generates Monte Carlo staleness phases targeting all observed read kinds with
lookback=2 and maxRestarts=1, producing 5 trials per phase.

Wire staleness plans into the default planFn in runScenario so they run as
a separate phase after ordering perturbation, keeping the two orthogonal.
Test that after applyEffects processes write effects with generation bumps
(which re-publish objects with new hashes), the stateEvent records the
post-modification hash matching Contents.contents — not the original
pre-bump hash. Covers CREATE, UPDATE, APPLY, and multi-effect paths.

This invariant is critical for ObserveAt replay correctness: if stateEvent
hashes diverge from Contents hashes, staleness simulation produces lookups
that miss in resourceVersions, silently yielding observedRV=0.
Register RVs at the same site where stateEvents are created, rather than
in a separate seeding loop in Explore(). This makes it structurally
impossible to construct a start state without populating resourceVersions.

ExplorerBuilder now owns the map, initialized in NewExplorerBuilder,
transferred to Explorer in Build(), and freshly allocated in Fork().
TraceChecker passes nil since it doesn't use staleness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fake CloudProvider uses test.RandomProviderID() seeded with
time.Now().UnixNano(), producing nondeterministic ProviderIDs across
runs. Wrap it with deterministicCloudProvider that derives ProviderIDs
from the NodeClaim name (which is deterministic via nameGeneratingClient).
Add Reset() to nameGeneratingClient so the counter can be zeroed between
Monte Carlo trials.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…launcher

The full lifecycle controller caused convergence failures because:
- Registration uses MatchingFields queries unsupported by the replay client
- Liveness generates RequeueAfter creating unbounded exploration depth
- Shared mutable state (nodepoolhealth, go-cache) leaked across trials

Replace with nodeClaimLauncher that only handles the launch stage
(cp.Create + populate NodeClaim details). Update nodeRegistrar to create
Nodes that are immediately ready with labels, allocatable/capacity, and
NodeReady condition. Add OnFork callback to reset shared in-memory state
(cluster, cloud provider, name counter) between Monte Carlo trials.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reflect the switch from nodeclaim.lifecycle/consistency to the
lightweight nodeClaimLauncher, the deterministic cloud provider, Monte
Carlo trial reset via OnFork, and document known limitations around
cluster sync and convergence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workflow JSONs specify ownerReferences with empty UIDs since the real
UID isn't known at authoring time. After EnsureDeterministicIdentity
assigns stable UIDs to all objects, a second pass now patches
ownerReference UIDs to match their owners in the same object set.
In real Kubernetes, a reconciler returning an error causes
controller-runtime to requeue with exponential backoff (5ms to ~16min).
The previous behavior of abandoning the exploration branch on error was
overly pessimistic and prevented distinguishing transient errors
(recoverable once other controllers advance state) from permanent ones.

Now reconciler errors are treated as no-ops and the reconciler is
re-enqueued, matching real controller-runtime semantics.
When the workflow JSON specifies targeted staleReads/staleLookback
tuning, use it directly in the rerun phase instead of auto-deriving
from observed reference trace reads. Fall back to auto-derived
staleness only when no explicit config is provided.

Also fix disablePerturbations to preserve UserActionReadyDepths
(scheduling metadata, not a perturbation) and ensure the map is
initialized to prevent nil map panics in downstream callers.
Replace per-step combinatorial branching with declarative staleness
intervals. A StalenessInterval defines a window [StaleAt, CatchUpAt)
during which a reconciler's view of a specific resource kind lags
behind actual cluster state, either frozen at StaleAt (Lag=-1) or
trailing by a fixed offset.

Key changes:
- Add StalenessInterval type to PerturbationConfig
- evaluateStalenessIntervals on StateNode computes stale KindSequences
- ObserveAs prioritizes interval evaluation over stuckReconcilerPositions
- getPossibleViewsForReconcile skips branching when intervals configured
- materializeNextState computes StalenessInfo from intervals
- determineNewPendingReconciles filters triggers for stale kinds
- Parse stalenessIntervals from workflow JSON via coverage.InputTuning
Two crossplane workflow inputs exercising the new stalenessIntervals config:

- composition-update-races-xr-fetch: CompositionReconciler sees frozen
  CompositionRevision during [6,10) window while CompositionRevisionReconciler
  updates it. Verified end-to-end with stalenessInfo in dump output.

- function-capability-removed: CompositionRevisionReconciler sees frozen
  FunctionRevision during [1,4) window after capability removal.
Five tests using FooController + BarController (real reconcilers with
mode-driven state machines) to exercise staleness intervals end-to-end:

- Baseline: 2 converged states (A-Final, B-Final) without intervals
- FrozenReconcilerPreventsModeFlip: freezing BarController on Foo
  eliminates B-Final path (BarController never sees A-1 to flip mode)
- StalenessInfoPopulated: verifies ReconcileResult carries per-object
  staleness details with correct lag/observedRV/currentRV
- TriggerFilteringReducesBarActivity: stale reconciler runs fewer
  steps because trigger filtering suppresses re-triggering
- WindowOutsideRange: inactive window [50,100) matches baseline,
  confirming intervals have no effect outside their sequence range
…stively

Replace old StalenessConfig/Monte Carlo approach with StalenessInterval-based
plans. For each observed (controller, kind) pair, exhaustively enumerate all
[staleAt, catchUpAt) windows over [0, maxSeq] (naive strategy that can be refined later)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move applyInputTuning logic into a shared pkg/explore.ApplyInputTuning function
that covers all perturbation fields: MaxDepth, PermuteControllers, StaleReads,
UserActionReadyDepths, StalenessIntervals, and Search/MonteCarlo. remove the
duplicated local implementations from all five example harnesses.
…phase

introduces a Phase 6 that guides agents to form targeted perturbation experiments
from unverified hypotheses. covers the hypothesis->perturbation decision tree,
how to read KindSequences from the trace to configure staleness intervals, variant
file conventions, re-run commands, and per-hypothesis outcome tracking.
tgoodwin and others added 27 commits March 16, 2026 22:14
- replayDynamicClient: bridges client.Client → dynamic.Interface
- replayClientSet: implements kroclient.SetInterface for replay mode
- replayCRDClient: CRD Ensure/Delete/Get via replay client

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace all hand-rolled controller simulations with real KRO Instance
Controller and ResourceGraphDefinitionReconciler. Both controllers run
real reconciliation logic through replay client adapters.

- Instance controller: real SSA Apply on child resources, requeue handling
- RGD controller: real graph building, CRD generation, status updates
- Graph built from RGD spec using core schema resolver (no API server)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ution

Both real KRO controllers execute through the kamera harness:
- RGD controller reconciles (CRD creation, status update)
- Instance controller runs (hits reconcile error, needs debugging)
- Graph built using schemaless parsing (nil schema resolver)

Uses typed RGD via generator package for correct API structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Instance Controller uses SSA Apply with partial patches (e.g., only
metadata fields). After Apply, the controller expects the full merged
object back. The dynamic adapter now re-reads via Get after Patch to
return the complete state.

Remaining issue: Instance Controller's SSA Apply for child resources
(Deployment, Service, Ingress) creates them, but the replay client's
applyEffects may not be properly upserting new objects. The controller
reports "waiting for unresolved resource deployment".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add missing 'image' field to Application instance fixture (CEL
  expressions reference schema.spec.image which triggers "no such key"
  → ErrDataPending without the default value)
- Apply re-reads full object after Patch to return merged state

Both real KRO controllers now execute end-to-end: RGD creates CRD,
Instance Controller creates Deployment/Service/Ingress via SSA Apply,
Deployment controller creates ReplicaSet/Pods, status updates propagate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When an APPLY effect doesn't change the object's spec, skip the effect
entirely. This prevents controllers that unconditionally re-apply the
same desired state from creating infinite reconcile loops.

Also increased max depth to 60 for KRO harness (real controllers
produce more cascading state changes than simulations).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- K1 scenario tests ordering perturbation between RGD and Instance
  controllers with ingress enabled
- When JSON inputs have no environmentState, fall back to default
  RGD state (avoids needing full typed RGD in JSON)
- Results: 4 converged states, all same hash — no ordering-dependent
  divergence detected

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents:
- Controller surface map (both real KRO controllers)
- Surgical KRO changes (5 commits, all non-behavioral)
- Harness architecture diagram
- K1 ordering scenario results (no divergence)
- Known limitations and Approach B fallback path
- SSA Apply idempotency fix
- Area coverage assessment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
K2: Fault injection — Instance Controller crash after 2nd/3rd write
    (partial child resource apply, tests SSA idempotency on retry)
K3: Staleness — Instance Controller reads stale Application during
    replicas scale-up (1→3), tests intermediate state correctness
K4: Ingress toggle — includeWhen condition changes with staleness,
    tests conditional resource creation/deletion under cache lag
K5: Fault injection — RGD Controller crash after CRD create / after
    status update (tests CRD exists but status stale)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fault injection

Instance Controller crash mid-apply produces 12 distinct final states
across 628 runs. The applyset's parallel child Apply + parent metadata
patch creates a vulnerability window where partial applies combined
with ApplySet label state leads to inconsistent prune/conflict behavior.

K5 (RGD fault injection): no divergence — fully idempotent.
K3, K4: need staleness config tuning for permuted runs.

Evidence persisted in .agents/evidence/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…final state

State analysis reveals crash recovery never recreates full child set:
- 425 runs: only Ingress survives (applied before crash)
- 186 runs: no children at all
- 17 runs: varied Application content (rare states)

No final state includes Deployment or Service. The ApplySet metadata
mismatch after partial apply prevents correct re-creation on retry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Even with triggerOnce (single crash, then clean recovery), the Instance
Controller fails to recreate all children. Across 2201 runs:
- 1439: only Ingress (no Deployment, no Service)
- 406: Deployment + Ingress (no Service!)
- 196: no children at all

Service is NEVER present in ANY final state. This is a confirmed crash
recovery bug in the Instance Controller's applyset implementation.

Updated ANALYSIS.md with exact source code references to the write
sequence and vulnerability windows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
K6: Tests Application deletion with ordering perturbation:
  - Delete with ingress enabled
  - Delete without ingress
  - Fault injection during deletion (crash after 1st delete write)

K7: Tests rapid back-to-back spec changes:
  - Triple scale: replicas 1→3→5
  - Combined: toggle ingress + scale + re-toggle

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… CRD deletion

K6 fault injection during deletion: 6 distinct final states across 304 runs:
- 26 runs: Application survives deletion (finalizer not removed)
- 2 runs: complete deletion failure (all 9 objects survive)
- 7 runs: CRD deleted despite allowCRDDeletion=false

K7 rapid spec changes: no divergence (reference only).
K3/K4 staleness: updated to staleReads format, no divergence found.

Updated ANALYSIS.md with full area coverage assessment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
High-level rollup of all confirmed bugs across projects:
- Karpenter: 12 bugs (7×P1, 4×P2, 1×P3)
- Crossplane: 9 findings (cycling blocks full convergence analysis)
- Kratix: 6 bugs (5×P2, 1×P3)
- KRO: 0 bugs (analyzed, clean)
- Total: 27 confirmed findings across 4 projects

Each project links to its detailed ANALYSIS.md for reproduction steps
and trace evidence.
…ionale

Each of the 5 kro changes now documented with exact commit hash, file
path, what changed, and why it's necessary. Includes note that changes
are required regardless of harness location (Go type system constraints).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dle unstructured SSA Apply

Three simulation fidelity fixes discovered while building the CAPI harness:

1. Errored reconcile effects discarded: when a controller returned an error
   mid-reconcile, all API writes that had already succeeded were thrown away.
   In real K8s these writes are durable. Now doReconcile always retrieves
   effects and reconcileAtState passes them through alongside the error.

2. applyConfigToUnstructured only extracted name/ns/kind/apiVersion from
   typed apply configs. CAPI uses unstructuredApplyConfiguration (wrapping
   *Unstructured) for SSA — added fallback via DeepCopyObject to preserve
   the full object including spec/status.

3. Multi-write version resolution used changes[effect.Key] (last-writer-wins
   map) instead of effect.Version for each individual effect. Also refined
   mergeStatusSubresourceObject to distinguish PATCH (preserve status on
   missing) from UPDATE (clear status on missing) semantics.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
K2b: Instance Controller crash mid-apply — Service never created (P1)
K6: Instance Controller crash during deletion — orphaned children (P1)

Total confirmed bugs across all projects: 29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	pkg/tracecheck/explore.go
…ntics

Real K8s API server (v1.21+) detects SSA Apply/Patch operations that produce
no field changes and skips the etcd write — no resourceVersion bump, no watch
event. Kamera was recording every Apply/Patch as an effect regardless of content
change, causing infinite reconcile cycles with controllers that unconditionally
re-apply desired state (e.g., CAPI MachineSet syncMachines).

Added no-op detection at three points in applyEffects:
- APPLY: compare spec + metadata of new vs existing object
- PATCH status subresource: compare merged hash vs existing hash
- PATCH non-status: compare effect hash vs existing state hash

Also adds metadataEqual helper that compares labels, annotations, ownerRefs,
and finalizers while ignoring server-managed fields.

Note: hash-based comparison for non-status PATCH is not yet catching all
no-op cases — the replay client may produce different hashes for logically
identical objects due to field ordering or empty-vs-nil differences. Further
investigation needed.
…enarios (3 bugs found)

Wire ClaimReconciler into harness with deterministic name generator and
no-op ConnectionPropagator. Upgrade Crossplane v2.1.0 → v2.2.0 for
controller-runtime v0.23.0 compatibility.

Add configurable stubFunctionRunner supporting 4 behaviors (default,
fatal, different-resources, partial-readiness) keyed by function name
in the Composition pipeline.

New findings:
- F6 (P2): Function switch to fatal orphans composed resources + stale
  Ready=True — 7 distinct terminal states across 49 trials
- C2 (P2): Claim deletion orphans XR + composed resources — 98% orphan
  rate across 98 trials
- C4 (P2): Two XRs silently steal ownership of same composed resource —
  ordering-dependent ownerReference on shared ConfigMap

Clean scenarios: F7 (resource switch GC works), F8 (flap recovery),
C1 (claim-XR binding), C3 (claim crash recovery), C5 (manual policy).

Also corrects Crossplane bug count in BUG-FINDINGS.md from 9 to 7
(re-evaluated against ANALYSIS.md evidence) and adds KCP + Cluster API
sections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the TODO block with a proper methodology section explaining
how we found 7 bugs despite all scenarios cycling (F2). Three
workarounds: Monte Carlo aborted-state hash comparison, fixed-depth
event injection, and leveraging divergence on non-cycling objects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…projects

KCP additions:
- KCP17: Pure ordering bug from apibinding reconciler (P2)
- KCP18b: Crash recovery divergence from fault injection (P2)

Three distinct root causes identified:
1. Endpoint condition chain race (no sync beyond watches)
2. Concurrent condition write conflict (multiple LC writers)
3. Crash recovery divergence (multi-write reconciler crash)

9 controllers wired, 20 scenarios, 3 regions explored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove examples/kro/kro from tracking and add it to .gitignore.
These compiled binaries should be built locally, not checked in.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tgoodwin tgoodwin changed the title staleness perturbation strategy staleness perturbation strategy improvements Mar 26, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant