Skip to content

update v2-rc2 from develop#1537

Merged
TheLastCicada merged 50 commits intov2-rc2from
develop
Mar 17, 2026
Merged

update v2-rc2 from develop#1537
TheLastCicada merged 50 commits intov2-rc2from
develop

Conversation

@TheLastCicada
Copy link
Contributor

@TheLastCicada TheLastCicada commented Mar 17, 2026

Note

Medium Risk
Touches organization lifecycle (create/upgrade/reclaim) and wallet/coin-splitting behaviors, which can impact availability and long-running blockchain operations; changes are guarded but could introduce new edge-case deadlocks/timeouts or CI flakiness.

Overview
Organization operations are now serialized across V1/V2. Adds a process-wide org-operation-lock and wires it into V1/V2 org create, upgrade, reclaim, and startup recovery tasks, returning 409 Conflict with live operation status when another org operation is running and surfacing lock status via /v2/organizations/creation-status.

Improves reliability of long-running blockchain workflows. Organization creation/resume now waits for sufficient spendable coins, updates live progress messages during phases, and adds retry logic for transient wallet errors during parallel store creation; datalayer store subscription waiting gains a hard timeout.

Adjusts mirroring and coin-management behavior. Mirror creation is skipped when AUTO_MIRROR_EXTERNAL_STORES is false or when DATALAYER_FILE_SERVER_URL/mirror URL is unset, adds governance-store mirroring for subscriber nodes with configured GOVERNANCE_BODY_ID, and coin management now uses a dust-safe coin size plus exposes a splitInProgress flag to avoid false “insufficient funds” errors during splits.

CI/build updates. Updates pkg/esbuild deps and packaging to explicitly include the sqlite3 native addon, fixes Windows sqlite path handling, adds addon verification + binary smoke test in build.yaml, modifies test config defaults (disables auto-mirroring), and adds a new test-v1-remote-sync workflow job with faucet funding and API-level data validation.

Written by Cursor Bugbot for commit 9abd65c. This will update automatically on new commits. Configure here.

TheLastCicada and others added 30 commits March 1, 2026 21:33
The reclaim-home endpoint documentation was present in both V1 and V2
API docs but missing from their tables of contents, making it
undiscoverable when browsing.
V1: add Get Organization Creation Status, Commit all projects in
STAGING, and Commit specific STAGING records by UUID to the TOC.

V2: add List/Filter project and unit GET examples (by orgUid,
program data, marketplace units, tokenized) and Create tokenized
unit on Chia POST example to the TOC.
yao-pkg v6.13.1 has a compatibility issue with Node 24 where the
runtime prelude throws ENOENT instead of allowing the bindings
package to retry alternate paths for node_sqlite3.node. Upgrade to
v6.14.1 which adds proper Node 24 support, add the native addon as
a pkg asset as a fallback, and add a CI step that fails fast with
a helpful message if the sqlite3 path changes in the future.
Update esbuild override from 0.25.12 to 0.27.3 to match the
requirement of @yao-pkg/pkg@6.14.1, which requires esbuild@^0.27.3.
The previous override caused a forced downgrade that risked runtime
failures during the binary packaging step.

Also fix the Windows sqlite-path matrix value to use forward slashes
so the bash-based verification step correctly resolves the path, and
add shell: bash to the Copy sqlite3 step for consistency.
Start the built binary after copy sqlite3 step, poll /health for up to
60 seconds, and fail the build if the binary crashes or never responds.
Catches missing or mispathed node_sqlite3.node and other native addon
issues before artifacts are signed and uploaded.
pkg 6.14.1's prelude resolves native addons at
node_modules/sqlite3/build/node_sqlite3.node (without Release/) but
the file only exists at build/Release/. Add a prepare-pkg-assets step
to all build scripts that copies the .node file to the expected path
and include both paths in pkg.assets so the snapshot contains the
addon where the prelude looks for it.

Also remove deprecated Vercel pkg from global install since the build
scripts use the local @yao-pkg/pkg from node_modules/.bin.
wait PID || true always sets $? to 0 because true succeeds. Use
wait PID || EXIT_CODE=$? instead so the actual process exit code
is reported when the binary crashes during startup.
…eation

_createStoresInParallel had no retry logic for transient wallet errors,
causing V2 org creation to fail permanently when the Chia wallet's
DataLayer wallet was in a transitional state. This was especially
likely with parallel store creation since multiple simultaneous
create_new_dl RPC calls compound the race condition.

Add per-store retry logic (10 attempts, 30s delay) matching the
existing pattern in addV2ToExistingGovernanceBody. Transient errors
including "DataLayer Wallet already exists" (downstream symptom of
"DataLayerWallet not available" race) are now retried instead of
causing immediate failure.
fix(CI): upgrade yao-pkg and fix Node 24 native addon resolution
The waitForSync loop in getSubscribedStoreData() could block
indefinitely if a store never finishes syncing. Add a 10-minute
deadline (matching the timeout used in the v2
getRegistryStoreIdFromSingleton) so a stuck store throws instead
of causing an infinite blocking loop.
fix: add 10-minute timeout to getSubscribedStoreData sync wait loop
…r checks

The case-sensitive includes('wallet') check didn't match any of the
specific Chia error messages (which all use uppercase 'Wallet') and
instead acted as a catch-all for unrelated errors containing the
substring, causing unnecessary retries of up to 5 minutes.
docs: add reclaim-home endpoint to table of contents
COIN_SIZE was hardcoded to 1,000,000 mojos while CADT operations
require DEFAULT_COIN_AMOUNT + DEFAULT_FEE (typically 600,000,000
mojos). This caused a perpetual splitting loop where coins were
created 600x too small to be usable, wasting fees and temporarily
draining spendable balance.

Set COIN_SIZE = DEFAULT_COIN_AMOUNT + DEFAULT_FEE so each split coin
can independently fund one full DataLayer operation. Add a
splitInProgress flag so mirror-check tasks log a warning instead of
an error when balance is temporarily reduced during a split.
fix: size split coins to match operational requirements
…etry check

The isTransient check in upgradeFromV1 was missing this error string
after the overly broad includes('wallet') was removed, causing the
v1-to-v2 upgrade path to fail permanently on this transient error
instead of retrying.
…nParallel

If the for loop somehow exhausted without returning (e.g. maxRetries
changed to 0), the async callback would return undefined, causing a
TypeError when downstream code accesses result.success.
Chia's default xch_spam_amount is 1,000,000 mojos. Coins smaller than
this may be filtered out by the wallet's spam filter. Since this setting
isn't available via RPC and CADT may run on a different host, use the
default as a floor so split coins are never below the dust threshold.
Test used COIN_SIZE = MIN_USABLE_COIN_SIZE (3,300) but production
computes COIN_SIZE = Math.max(MIN_USABLE_COIN_SIZE, DUST_FILTER_FLOOR)
which equals 1,000,000. Updated constants and assertions to match
the actual coin-splitting arithmetic.
Both _createStoresInParallel and upgradeFromV1 maintained independent
copies of the transient wallet error list. This duplication already
caused a drift bug caught during this PR. Extract to a single
module-level helper.
…n-retry

fix: add retry logic for transient wallet errors in parallel store creation
When a data layer store is stuck syncing (e.g., a delta file missing
from all mirrors), the sync-registries task would detect the root
history mismatch every 5 seconds, log a warn, and return -- repeating
identically forever. This caused massive log noise, unnecessary CPU
from repeated root history + sync status queries, and no useful signal
after the first log entry.

Add per-org mismatch tracking with exponential backoff (30s initial,
2x multiplier, 10min cap) and log throttling (warn on first hit, debug
on retries, info summary every 5 minutes). When the mismatch resolves,
the tracker clears and normal 5s polling resumes immediately.
Prevent simultaneous execution of organization create, upgrade, and
reclaim operations with a process-level in-memory lock. Conflicting
requests now receive 409 Conflict with live operation status including
the operation name, current step, start time, and elapsed seconds.

- Add org-operation-lock module with tryAcquire/release/status tracking
- Guard V1 and V2 create, upgrade, and reclaim controller endpoints
- Add updateOrgLockStatus milestone calls in model creation/upgrade flows
- Integrate lock acquire/release into V1 and V2 startup recovery tasks
- Enhance GET /v2/organizations/creation-status with liveStatus field
- Add unit tests for lock module and integration tests for endpoint guards
- Document 409 behavior and liveStatus in cadt_rpc_api_v2.md
TheLastCicada and others added 20 commits March 13, 2026 13:18
Move duplicated per-org exponential backoff logic from both
sync-registries.js and sync-registries-v2.js into a shared
SyncMismatchBackoff class. Algorithm and log output are identical;
future tuning only needs to change one place.
Add a 1-hour staleness TTL to the org operation lock so that a hung
background operation (promise that never resolves/rejects) cannot
permanently block all future organization operations. When a lock
exceeds MAX_LOCK_AGE_MS, tryAcquireOrgLock force-releases it with a
console warning and allows the new acquisition.

Also add the missing releaseOrgLock() call in the V2 create endpoint's
outer catch block, which could permanently hold the lock if a database
query threw between lock acquisition and the async creation branches.
The outer catch blocks in all 6 guarded handlers unconditionally called
releaseOrgLock(), but the lock is acquired partway through the try
block (after assertions like assertWalletIsSynced). If a pre-lock
assertion throws while another operation holds the lock, the catch
would release that other operation's lock, defeating the concurrency
guard.

Add a lockAcquired flag to each handler, set after successful
tryAcquireOrgLock(), and gate the outer catch release on it.
Replace all bare releaseOrgLock() calls in handler try blocks with a
local releaseLock() closure that atomically checks and clears the
lockAcquired flag before releasing. This eliminates two classes of bugs:

1. Early-release paths (e.g. V1 org detected) left lockAcquired=true.
   Any subsequent await could yield, letting another request acquire
   the lock, and if an error then reached the outer catch it would
   release that other operation's lock.

2. Reclaim handlers had an inner try/finally that released the lock,
   then the outer catch also released — a double-release that could
   free another operation's lock if one was acquired in between.

The only remaining bare releaseOrgLock() calls are inside async
.finally() callbacks on background operations, where lockAcquired is
explicitly set to false at handoff time.
releaseOrgLock() had no concept of who held the lock. When a background
.finally() ran after its operation's lock was force-released due to TTL
expiry, it silently released a different operation's lock, defeating the
concurrency guard.

tryAcquireOrgLock() now returns an opaque ownership token (or null on
failure). releaseOrgLock(token) only releases if the token matches the
current holder, making stale releases a safe no-op. All controllers and
recovery tasks pass captured tokens through their async .finally() and
try/finally blocks.
…k release

updateOrgLockStatus() unconditionally wrote to currentStatus without
verifying the caller owns the lock. After a stale lock force-release
(1-hour TTL), background code from the old operation could overwrite
the new holder's status on the creation-status endpoint.

Change updateOrgLockStatus(status) to updateOrgLockStatus(token, status)
so only the current lock holder can update progress. Thread the lock
token through createHomeOrganization, _resumeOrganizationCreation,
_executeOrganizationCreation, and upgradeFromV1 in both V1 and V2
models, controllers, and recovery tasks.

Also move releaseLock() in the V2 create handler's V1-org check block
from before the async singleton data fetch to just before each return,
closing a window where a concurrent request could acquire the lock
while async I/O was still in flight.

Remove getOrgLockOperation and isOrgLocked exports (only consumed by
tests, not production code). Refactor tests to use getOrgLockStatus()
and add coverage for stale-token rejection.
When a lock-held operation (e.g. upgrade) was running but no Meta-based
creation existed, the creation-status response had inProgress: true
alongside message: "No organization creation in progress" because the
lock-status spread overrode inProgress but not message.

Include a coherent message derived from the lock operation name and
current step. Update the API docs example and add a test assertion
for the message field.
…ency-guard

feat: add in-memory concurrency guard for organization operations
V1 _createStoresInParallel called createDataLayerStore without checking
wallet sync status, causing permanent failure when the wallet
temporarily desyncs. The V2 equivalent already has this protection.

Add waitForSpendableCoins() before V1 store creation and resume paths,
and per-store retry with exponential backoff for transient wallet errors
(matching the V2 implementation).

Extract TRANSIENT_WALLET_ERRORS and isTransientWalletError into
wallet.js so both V1 and V2 models share a single definition. Replace
hardcoded [v2]: log prefix in wallet.js with [wallet]: since the
function is now called from both V1 and V2 paths. Use actual needed
coin count from getStoresToCreate(state) in resume paths instead of
hardcoded 4.
The v1 importOrganization had a chicken-and-egg bug where it checked
datalayer sync status before subscribing to the store. For stores that
were never subscribed, get_sync_status returns an error (store unknown
to datalayer), which was interpreted as "not synced" causing an early
return. The subscription call was never reached, so the store was never
subscribed, and every retry hit the same wall.

Port the subscribe-first pattern already used in v2 importOrganization:
subscribe to the store first (no-op if already subscribed), then check
sync status. This ensures new stores begin syncing on the first pass
and subsequent retries find them progressively more synced.
…et-retry

fix: add wallet readiness check and retry logic to V1 store creation
…e-first

fix: subscribe to org store before sync check in v1 importOrganization
The mirror-check task only mirrored governance stores for governance
body owners (nodes with governanceBodyId/mainGoveranceBodyId in meta).
Subscriber nodes subscribe to governance stores via GOVERNANCE_BODY_ID
in config but the mirror task had no code path to mirror them.

Read GOVERNANCE_BODY_ID from config when meta entries are absent,
mirror the main governance store, then resolve and mirror its version
store via the data model version key.
…ware

The global middleware asserted wallet sync status on every non-health
request, including read-only GETs. When the wallet transiently desyncs
while processing new datalayer block confirmations, all GET endpoints
return 400 even though they only read from the local database.

Skip assertWalletIsAvailable() for GET requests, matching the existing
pattern where the home org sync check already allows GETs through.
Write operations still require the wallet to be synced.
…-on-get

fix: skip wallet availability check for GET requests in global middleware
Replace manual subscribe/import flow in test-v2-remote-sync with
automatic governance-based discovery. Add faucet funding, mirror
creation validation, and mirror cleanup. Create new parallel
test-v1-remote-sync job with identical structure using V1 config,
governance, and data validation against the V1 participant instance.
Mirrors created on shared governance stores during CI runs leave
orphaned coins that can never be deleted (the CI wallet key is
destroyed after each run). These accumulate over time and contribute
to wallet sync delays in subsequent runs.

Decouple inline store mirroring from the background mirror-check task:
createDataLayerStore now mirrors based on DATALAYER_FILE_SERVER_URL
being configured rather than AUTO_MIRROR_EXTERNAL_STORES. This lets
CI create mirrors on ephemeral org stores (safe) while keeping the
background task disabled so governance stores are never mirrored.

- Set AUTO_MIRROR_EXTERNAL_STORES=false in all CI jobs
- Set DATALAYER_FILE_SERVER_URL=null for sync, upgrade, and governance
  jobs (no mirrors at all)
- Keep DATALAYER_FILE_SERVER_URL set for v1/v2 live-api jobs (mirrors
  only on org-created stores, validated by existing test helpers)
- Remove mirror verify and delete steps from sync jobs
- Remove mirror delete step from v2 live-api job
Replace nested try/finally inside try/catch with a flat
try/catch/finally in both V1 and V2 reclaimHome handlers. The inner
finally nulled the lock token, making the outer catch release a no-op.
A single finally block handles lock release on all exit paths.
subscribeToStoreOnDataLayer unconditionally created a mirror on every
subscribe, ignoring AUTO_MIRROR_EXTERNAL_STORES. Gate the addMirror
call on the config setting (defaulting to true) so users who set it
to false do not get mirrors for external stores.

Also update the README to clarify that both AUTO_MIRROR_EXTERNAL_STORES
and DATALAYER_FILE_SERVER_URL must be configured for external store
mirroring to function.
…-backoff

fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp
@TheLastCicada TheLastCicada merged commit 769d850 into v2-rc2 Mar 17, 2026
31 of 32 checks passed
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

// Default AUTO_MIRROR_EXTERNAL_STORES to true if it is null or undefined
// This make sure this runs by default even if the config param is missing
const shouldMirror = AUTO_MIRROR_EXTERNAL_STORES ?? true;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AUTO_MIRROR_EXTERNAL_STORES config ignored in store creation

Medium Severity

The AUTO_MIRROR_EXTERNAL_STORES config setting is no longer consulted when creating new DataLayer stores. The old code explicitly checked AUTO_MIRROR_EXTERNAL_STORES ?? true before calling addMirror; the new code checks only if (mirrorUrl), which is true whenever DATALAYER_FILE_SERVER_URL is configured. A user who sets AUTO_MIRROR_EXTERNAL_STORES = false while also having a DATALAYER_FILE_SERVER_URL configured will now get mirrors created for every new store against their stated preference. The subscription path in persistance.js correctly retained the shouldMirror check, creating an inconsistency between the two code paths.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants