Conversation
The reclaim-home endpoint documentation was present in both V1 and V2 API docs but missing from their tables of contents, making it undiscoverable when browsing.
V1: add Get Organization Creation Status, Commit all projects in STAGING, and Commit specific STAGING records by UUID to the TOC. V2: add List/Filter project and unit GET examples (by orgUid, program data, marketplace units, tokenized) and Create tokenized unit on Chia POST example to the TOC.
…overnance heading
yao-pkg v6.13.1 has a compatibility issue with Node 24 where the runtime prelude throws ENOENT instead of allowing the bindings package to retry alternate paths for node_sqlite3.node. Upgrade to v6.14.1 which adds proper Node 24 support, add the native addon as a pkg asset as a fallback, and add a CI step that fails fast with a helpful message if the sqlite3 path changes in the future.
Update esbuild override from 0.25.12 to 0.27.3 to match the requirement of @yao-pkg/pkg@6.14.1, which requires esbuild@^0.27.3. The previous override caused a forced downgrade that risked runtime failures during the binary packaging step. Also fix the Windows sqlite-path matrix value to use forward slashes so the bash-based verification step correctly resolves the path, and add shell: bash to the Copy sqlite3 step for consistency.
Start the built binary after copy sqlite3 step, poll /health for up to 60 seconds, and fail the build if the binary crashes or never responds. Catches missing or mispathed node_sqlite3.node and other native addon issues before artifacts are signed and uploaded.
pkg 6.14.1's prelude resolves native addons at node_modules/sqlite3/build/node_sqlite3.node (without Release/) but the file only exists at build/Release/. Add a prepare-pkg-assets step to all build scripts that copies the .node file to the expected path and include both paths in pkg.assets so the snapshot contains the addon where the prelude looks for it. Also remove deprecated Vercel pkg from global install since the build scripts use the local @yao-pkg/pkg from node_modules/.bin.
wait PID || true always sets $? to 0 because true succeeds. Use wait PID || EXIT_CODE=$? instead so the actual process exit code is reported when the binary crashes during startup.
…eation _createStoresInParallel had no retry logic for transient wallet errors, causing V2 org creation to fail permanently when the Chia wallet's DataLayer wallet was in a transitional state. This was especially likely with parallel store creation since multiple simultaneous create_new_dl RPC calls compound the race condition. Add per-store retry logic (10 attempts, 30s delay) matching the existing pattern in addV2ToExistingGovernanceBody. Transient errors including "DataLayer Wallet already exists" (downstream symptom of "DataLayerWallet not available" race) are now retried instead of causing immediate failure.
fix(CI): upgrade yao-pkg and fix Node 24 native addon resolution
The waitForSync loop in getSubscribedStoreData() could block indefinitely if a store never finishes syncing. Add a 10-minute deadline (matching the timeout used in the v2 getRegistryStoreIdFromSingleton) so a stuck store throws instead of causing an infinite blocking loop.
fix: add 10-minute timeout to getSubscribedStoreData sync wait loop
Update Managed Files
…r checks
The case-sensitive includes('wallet') check didn't match any of the
specific Chia error messages (which all use uppercase 'Wallet') and
instead acted as a catch-all for unrelated errors containing the
substring, causing unnecessary retries of up to 5 minutes.
docs: add reclaim-home endpoint to table of contents
COIN_SIZE was hardcoded to 1,000,000 mojos while CADT operations require DEFAULT_COIN_AMOUNT + DEFAULT_FEE (typically 600,000,000 mojos). This caused a perpetual splitting loop where coins were created 600x too small to be usable, wasting fees and temporarily draining spendable balance. Set COIN_SIZE = DEFAULT_COIN_AMOUNT + DEFAULT_FEE so each split coin can independently fund one full DataLayer operation. Add a splitInProgress flag so mirror-check tasks log a warning instead of an error when balance is temporarily reduced during a split.
fix: size split coins to match operational requirements
…etry check
The isTransient check in upgradeFromV1 was missing this error string
after the overly broad includes('wallet') was removed, causing the
v1-to-v2 upgrade path to fail permanently on this transient error
instead of retrying.
…nParallel If the for loop somehow exhausted without returning (e.g. maxRetries changed to 0), the async callback would return undefined, causing a TypeError when downstream code accesses result.success.
Chia's default xch_spam_amount is 1,000,000 mojos. Coins smaller than this may be filtered out by the wallet's spam filter. Since this setting isn't available via RPC and CADT may run on a different host, use the default as a floor so split coins are never below the dust threshold.
Test used COIN_SIZE = MIN_USABLE_COIN_SIZE (3,300) but production computes COIN_SIZE = Math.max(MIN_USABLE_COIN_SIZE, DUST_FILTER_FLOOR) which equals 1,000,000. Updated constants and assertions to match the actual coin-splitting arithmetic.
Both _createStoresInParallel and upgradeFromV1 maintained independent copies of the transient wallet error list. This duplication already caused a drift bug caught during this PR. Extract to a single module-level helper.
…n-retry fix: add retry logic for transient wallet errors in parallel store creation
When a data layer store is stuck syncing (e.g., a delta file missing from all mirrors), the sync-registries task would detect the root history mismatch every 5 seconds, log a warn, and return -- repeating identically forever. This caused massive log noise, unnecessary CPU from repeated root history + sync status queries, and no useful signal after the first log entry. Add per-org mismatch tracking with exponential backoff (30s initial, 2x multiplier, 10min cap) and log throttling (warn on first hit, debug on retries, info summary every 5 minutes). When the mismatch resolves, the tracker clears and normal 5s polling resumes immediately.
Update Managed Files
Prevent simultaneous execution of organization create, upgrade, and reclaim operations with a process-level in-memory lock. Conflicting requests now receive 409 Conflict with live operation status including the operation name, current step, start time, and elapsed seconds. - Add org-operation-lock module with tryAcquire/release/status tracking - Guard V1 and V2 create, upgrade, and reclaim controller endpoints - Add updateOrgLockStatus milestone calls in model creation/upgrade flows - Integrate lock acquire/release into V1 and V2 startup recovery tasks - Enhance GET /v2/organizations/creation-status with liveStatus field - Add unit tests for lock module and integration tests for endpoint guards - Document 409 behavior and liveStatus in cadt_rpc_api_v2.md
Move duplicated per-org exponential backoff logic from both sync-registries.js and sync-registries-v2.js into a shared SyncMismatchBackoff class. Algorithm and log output are identical; future tuning only needs to change one place.
Add a 1-hour staleness TTL to the org operation lock so that a hung background operation (promise that never resolves/rejects) cannot permanently block all future organization operations. When a lock exceeds MAX_LOCK_AGE_MS, tryAcquireOrgLock force-releases it with a console warning and allows the new acquisition. Also add the missing releaseOrgLock() call in the V2 create endpoint's outer catch block, which could permanently hold the lock if a database query threw between lock acquisition and the async creation branches.
The outer catch blocks in all 6 guarded handlers unconditionally called releaseOrgLock(), but the lock is acquired partway through the try block (after assertions like assertWalletIsSynced). If a pre-lock assertion throws while another operation holds the lock, the catch would release that other operation's lock, defeating the concurrency guard. Add a lockAcquired flag to each handler, set after successful tryAcquireOrgLock(), and gate the outer catch release on it.
Replace all bare releaseOrgLock() calls in handler try blocks with a local releaseLock() closure that atomically checks and clears the lockAcquired flag before releasing. This eliminates two classes of bugs: 1. Early-release paths (e.g. V1 org detected) left lockAcquired=true. Any subsequent await could yield, letting another request acquire the lock, and if an error then reached the outer catch it would release that other operation's lock. 2. Reclaim handlers had an inner try/finally that released the lock, then the outer catch also released — a double-release that could free another operation's lock if one was acquired in between. The only remaining bare releaseOrgLock() calls are inside async .finally() callbacks on background operations, where lockAcquired is explicitly set to false at handoff time.
releaseOrgLock() had no concept of who held the lock. When a background .finally() ran after its operation's lock was force-released due to TTL expiry, it silently released a different operation's lock, defeating the concurrency guard. tryAcquireOrgLock() now returns an opaque ownership token (or null on failure). releaseOrgLock(token) only releases if the token matches the current holder, making stale releases a safe no-op. All controllers and recovery tasks pass captured tokens through their async .finally() and try/finally blocks.
…k release updateOrgLockStatus() unconditionally wrote to currentStatus without verifying the caller owns the lock. After a stale lock force-release (1-hour TTL), background code from the old operation could overwrite the new holder's status on the creation-status endpoint. Change updateOrgLockStatus(status) to updateOrgLockStatus(token, status) so only the current lock holder can update progress. Thread the lock token through createHomeOrganization, _resumeOrganizationCreation, _executeOrganizationCreation, and upgradeFromV1 in both V1 and V2 models, controllers, and recovery tasks. Also move releaseLock() in the V2 create handler's V1-org check block from before the async singleton data fetch to just before each return, closing a window where a concurrent request could acquire the lock while async I/O was still in flight. Remove getOrgLockOperation and isOrgLocked exports (only consumed by tests, not production code). Refactor tests to use getOrgLockStatus() and add coverage for stale-token rejection.
When a lock-held operation (e.g. upgrade) was running but no Meta-based creation existed, the creation-status response had inProgress: true alongside message: "No organization creation in progress" because the lock-status spread overrode inProgress but not message. Include a coherent message derived from the lock operation name and current step. Update the API docs example and add a test assertion for the message field.
…ency-guard feat: add in-memory concurrency guard for organization operations
V1 _createStoresInParallel called createDataLayerStore without checking wallet sync status, causing permanent failure when the wallet temporarily desyncs. The V2 equivalent already has this protection. Add waitForSpendableCoins() before V1 store creation and resume paths, and per-store retry with exponential backoff for transient wallet errors (matching the V2 implementation). Extract TRANSIENT_WALLET_ERRORS and isTransientWalletError into wallet.js so both V1 and V2 models share a single definition. Replace hardcoded [v2]: log prefix in wallet.js with [wallet]: since the function is now called from both V1 and V2 paths. Use actual needed coin count from getStoresToCreate(state) in resume paths instead of hardcoded 4.
The v1 importOrganization had a chicken-and-egg bug where it checked datalayer sync status before subscribing to the store. For stores that were never subscribed, get_sync_status returns an error (store unknown to datalayer), which was interpreted as "not synced" causing an early return. The subscription call was never reached, so the store was never subscribed, and every retry hit the same wall. Port the subscribe-first pattern already used in v2 importOrganization: subscribe to the store first (no-op if already subscribed), then check sync status. This ensures new stores begin syncing on the first pass and subsequent retries find them progressively more synced.
…et-retry fix: add wallet readiness check and retry logic to V1 store creation
…e-first fix: subscribe to org store before sync check in v1 importOrganization
The mirror-check task only mirrored governance stores for governance body owners (nodes with governanceBodyId/mainGoveranceBodyId in meta). Subscriber nodes subscribe to governance stores via GOVERNANCE_BODY_ID in config but the mirror task had no code path to mirror them. Read GOVERNANCE_BODY_ID from config when meta entries are absent, mirror the main governance store, then resolve and mirror its version store via the data model version key.
…ware The global middleware asserted wallet sync status on every non-health request, including read-only GETs. When the wallet transiently desyncs while processing new datalayer block confirmations, all GET endpoints return 400 even though they only read from the local database. Skip assertWalletIsAvailable() for GET requests, matching the existing pattern where the home org sync check already allows GETs through. Write operations still require the wallet to be synced.
…-on-get fix: skip wallet availability check for GET requests in global middleware
Replace manual subscribe/import flow in test-v2-remote-sync with automatic governance-based discovery. Add faucet funding, mirror creation validation, and mirror cleanup. Create new parallel test-v1-remote-sync job with identical structure using V1 config, governance, and data validation against the V1 participant instance.
Mirrors created on shared governance stores during CI runs leave orphaned coins that can never be deleted (the CI wallet key is destroyed after each run). These accumulate over time and contribute to wallet sync delays in subsequent runs. Decouple inline store mirroring from the background mirror-check task: createDataLayerStore now mirrors based on DATALAYER_FILE_SERVER_URL being configured rather than AUTO_MIRROR_EXTERNAL_STORES. This lets CI create mirrors on ephemeral org stores (safe) while keeping the background task disabled so governance stores are never mirrored. - Set AUTO_MIRROR_EXTERNAL_STORES=false in all CI jobs - Set DATALAYER_FILE_SERVER_URL=null for sync, upgrade, and governance jobs (no mirrors at all) - Keep DATALAYER_FILE_SERVER_URL set for v1/v2 live-api jobs (mirrors only on org-created stores, validated by existing test helpers) - Remove mirror verify and delete steps from sync jobs - Remove mirror delete step from v2 live-api job
Replace nested try/finally inside try/catch with a flat try/catch/finally in both V1 and V2 reclaimHome handlers. The inner finally nulled the lock token, making the outer catch release a no-op. A single finally block handles lock release on all exit paths.
subscribeToStoreOnDataLayer unconditionally created a mirror on every subscribe, ignoring AUTO_MIRROR_EXTERNAL_STORES. Gate the addMirror call on the config setting (defaulting to true) so users who set it to false do not get mirrors for external stores. Also update the README to clarify that both AUTO_MIRROR_EXTERNAL_STORES and DATALAYER_FILE_SERVER_URL must be configured for external store mirroring to function.
…-backoff fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| // Default AUTO_MIRROR_EXTERNAL_STORES to true if it is null or undefined | ||
| // This make sure this runs by default even if the config param is missing | ||
| const shouldMirror = AUTO_MIRROR_EXTERNAL_STORES ?? true; | ||
|
|
There was a problem hiding this comment.
AUTO_MIRROR_EXTERNAL_STORES config ignored in store creation
Medium Severity
The AUTO_MIRROR_EXTERNAL_STORES config setting is no longer consulted when creating new DataLayer stores. The old code explicitly checked AUTO_MIRROR_EXTERNAL_STORES ?? true before calling addMirror; the new code checks only if (mirrorUrl), which is true whenever DATALAYER_FILE_SERVER_URL is configured. A user who sets AUTO_MIRROR_EXTERNAL_STORES = false while also having a DATALAYER_FILE_SERVER_URL configured will now get mirrors created for every new store against their stated preference. The subscription path in persistance.js correctly retained the shouldMirror check, creating an inconsistency between the two code paths.


Note
Medium Risk
Touches organization lifecycle (create/upgrade/reclaim) and wallet/coin-splitting behaviors, which can impact availability and long-running blockchain operations; changes are guarded but could introduce new edge-case deadlocks/timeouts or CI flakiness.
Overview
Organization operations are now serialized across V1/V2. Adds a process-wide
org-operation-lockand wires it into V1/V2 orgcreate,upgrade,reclaim, and startup recovery tasks, returning409 Conflictwith live operation status when another org operation is running and surfacing lock status via/v2/organizations/creation-status.Improves reliability of long-running blockchain workflows. Organization creation/resume now waits for sufficient spendable coins, updates live progress messages during phases, and adds retry logic for transient wallet errors during parallel store creation; datalayer store subscription waiting gains a hard timeout.
Adjusts mirroring and coin-management behavior. Mirror creation is skipped when
AUTO_MIRROR_EXTERNAL_STORESis false or whenDATALAYER_FILE_SERVER_URL/mirror URL is unset, adds governance-store mirroring for subscriber nodes with configuredGOVERNANCE_BODY_ID, and coin management now uses a dust-safe coin size plus exposes asplitInProgressflag to avoid false “insufficient funds” errors during splits.CI/build updates. Updates
pkg/esbuilddeps and packaging to explicitly include thesqlite3native addon, fixes Windows sqlite path handling, adds addon verification + binary smoke test inbuild.yaml, modifies test config defaults (disables auto-mirroring), and adds a newtest-v1-remote-syncworkflow job with faucet funding and API-level data validation.Written by Cursor Bugbot for commit 9abd65c. This will update automatically on new commits. Configure here.