Skip to content

E2E Test Suite Rewrite: Page Object Model, Setup Projects, and Flakiness Elimination#487

Open
Legend101Zz wants to merge 7 commits intocaravan-bitcoin:mainfrom
Legend101Zz:fix/e2e
Open

E2E Test Suite Rewrite: Page Object Model, Setup Projects, and Flakiness Elimination#487
Legend101Zz wants to merge 7 commits intocaravan-bitcoin:mainfrom
Legend101Zz:fix/e2e

Conversation

@Legend101Zz
Copy link
Contributor

Summary

This PR is a complete rewrite of the E2E test infrastructure. It addresses two CI-blocking test failures, eliminates all sources of nondeterministic flakiness, and restructures the suite so that future contributions are guided into correct patterns by default.

The rewrite introduces the Page Object Model, Playwright Setup Projects with explicit dependency management, custom fixtures for dependency injection, precondition validators as circuit breakers, and a strict separation between infrastructure setup and behavioral assertions.

Zero waitForTimeout calls remain in the codebase. Every wait is now condition-based.


The Problem

Two tests fail consistently in CI while passing locally:

Failure 102-wallet_regtest.spec.ts: Address extraction returns an empty string.

Expected pattern: /^2[MN]/
Received string:  ""

Failure 203-transaction_flow.spec.ts: Download button times out.

TimeoutError: page.waitForSelector: Timeout 10000ms exceeded.
waiting for locator('button:has-text("Download Unsigned PSBT")') to be visible

These are not isolated bugs. They are symptoms of structural problems that would keep producing new failures as the test suite grows.


Root Cause Analysis

Failure 1: Empty Address Extraction

The failure chain:

flowchart TD
    A["extractReceiveTableData calls<br/>waitForSelector('table tbody tr')"] --> B["Waits for ROW to exist in DOM"]
    B --> C["React renders empty skeleton row"]
    C --> D["waitForSelector returns: row exists"]
    D --> E["page.evaluate reads cells[4].textContent"]
    E --> F["Cell is empty: returns ''"]
    F --> G["getCurrentReceiveAddress returns ''"]
    G --> H["Test uses '' as a real address"]
    H --> I["expect('').toMatch(/^2[MN]/) -- FAIL"]

    style A fill:#f5f5f5,stroke:#333
    style F fill:#ffcccc,stroke:#cc0000
    style I fill:#ffcccc,stroke:#cc0000
Loading

The core issue is that waitForSelector("table tbody tr") checks for DOM structure, not data content. React frequently renders table rows before populating them with data. In CI, where container resources are shared and slower, the gap between "row exists" and "row has data" is wider than on local machines.

Additionally, there is an off-by-one bug in the address collection loop. The condition if (i < 4) is always true when i ranges from 0 to 3, causing the loop to click "Next Address" four times instead of three, navigating past the intended range.

Failure 2: Download Button Timeout

The failure chain:

flowchart TD
    A["page.waitForEvent('download') registered"] --> B["10-second timer starts counting"]
    B --> C["page.waitForSelector('Download button')"]
    C --> D["PSBT construction takes 6+ seconds in CI"]
    D --> E["Button appears at second 7"]
    E --> F["Only 3 seconds remain on download timer"]
    F --> G["Click triggers download"]
    G --> H["Download takes > 3 seconds"]
    H --> I["Timer expires -- TIMEOUT"]

    style A fill:#ffeecc,stroke:#cc8800
    style B fill:#ffeecc,stroke:#cc8800
    style I fill:#ffcccc,stroke:#cc0000
Loading

The download event listener is registered before confirming the button exists. The listener's internal timeout starts immediately, but the button may not appear for several seconds while the PSBT is being constructed. By the time the button appears and the click happens, most of the timeout budget has been consumed.

The correct pattern is: confirm button exists, then register listener, then click. This ensures the full timeout is available for the actual download.

Structural Issues

Beyond the two failures, our current e2e system had these systemic problems:

Issue Severity Instances
waitForTimeout used as readiness signal Critical 8+ locations
No validation of extracted data before use Critical All table extractors
Sequential state coupling with no recovery High Tests 01, 02, 03
Download event listener race conditions High Tests 01, 03
Duplicated wallet import logic Medium Tests 02, 03 beforeEach
try/catch wrappers destroying stack traces Medium All test bodies
God object page abstraction Medium Single WalletPage class
No CI retry strategy for transient failures Medium Config level
Implicit test ordering via filename prefixes Low All test files

Architectural Changes

The rewrite touches every layer of the test infrastructure. This diagram shows the before and after:

flowchart LR
    subgraph BEFORE["Before: Flat structure"]
        direction TB
        T1["01-multisig_setup.spec.ts"]
        T2["02-wallet_regtest.spec.ts"]
        T3["03-transaction_flow.spec.ts"]
        TH["testhelpers/"]
        UT["utils/"]

        T1 -->|"implicit ordering"| T2
        T2 -->|"implicit ordering"| T3
        T1 --> TH
        T2 --> TH
        T3 --> TH
        T1 --> UT
        T2 --> UT
        T3 --> UT
    end

    subgraph AFTER["After: Layered architecture"]
        direction TB
        subgraph Projects
            SM["smoke.spec.ts"]
            WS["wallet.setup.ts"]
            WD["wallet-display.spec.ts"]
            TF["transaction-flow.spec.ts"]
        end
        subgraph PageObjects["pages/"]
            PO1["WalletSetupPage"]
            PO2["WalletImportPage"]
            PO3["WalletNavigation"]
            PO4["ReceiveTab / SendTab / SignTab"]
        end
        subgraph Infrastructure["fixtures/ + services/ + state/"]
            FX["caravan.fixture.ts"]
            PC["preconditions.ts"]
            SV["bitcoinClient / bitcoinServices"]
            ST["testState"]
        end

        WS -->|"Playwright dependencies"| WD
        WS -->|"Playwright dependencies"| TF
        WD --> PO2
        WD --> PO3
        TF --> PO2
        TF --> PO4
        PO2 --> FX
        FX --> SV
        FX --> ST
        WD --> PC
        TF --> PC
    end
Loading

Key decisions:

  1. Setup Projects replace implicit filename ordering with Playwright-native dependency management.
  2. Page Object Model replaces raw selectors in test files with encapsulated UI interaction classes.
  3. Custom Fixtures replace manual object creation with Playwright-managed dependency injection.
  4. Precondition Validators replace silent cascading failures with immediate, descriptive circuit breakers.
  5. WalletPage split into three focused classes (Setup, Import, Navigation) eliminates the god object.

Directory Structure

e2e/
  pages/                          Page Object Models
    HomePage.ts                   Landing page interactions
    WalletSetupPage.ts            Wallet creation: xpub entry, download config
    WalletImportPage.ts           Config upload, password, address import
    WalletNavigation.ts           Tabs, refresh, balance (shared chrome)
    ReceiveTab.ts                 Address display, navigation, collection
    AddressesTab.ts               Address table with balance data
    SendTab.ts                    Transaction form, UTXO selection, PSBT download
    SignTab.ts                    Signature import, broadcast

  fixtures/
    caravan.fixture.ts            Extended Playwright test with typed fixtures
    preconditions.ts              Circuit breakers for cross-project dependencies

  services/                       Bitcoin RPC (no browser interaction)
    bitcoinClient.ts              Client factory and config
    bitcoinServices.ts            BitcoinCoreService class
    descriptors.ts                Wallet descriptor extraction
    psbtHelpers.ts                PSBT signing via RPC

  state/                          Shared state (file-based IPC)
    types.ts                      All type definitions
    testState.ts                  TestStateManager class
    testStateSetup.ts             State file creation

  setup/                          Global lifecycle
    globalSetup.ts                Docker containers, Bitcoin Core, wallet creation
    globalTeardown.ts             Cleanup
    testFixtures.ts               createTestWalletsAndFund

  tests/
    wallet.setup.ts               SETUP: create wallet, modify config, fund addresses
    smoke.spec.ts                 SMOKE: homepage, navigation, connection tests
    wallet-display.spec.ts        BEHAVIORAL: address format, balance verification
    transaction-flow.spec.ts      BEHAVIORAL: tx creation, signing, broadcast

  playwright.config.ts
  CONTRIBUTING.md

The separation between directories enforces a rule: pages/ files import from state/types.ts but never from services/. tests/ files import from fixtures/ but never directly from pages/. services/ files never import from pages/. These boundaries prevent the kind of entanglement that made the old structure difficult to modify.


Page Object Model

The Page Object Model encapsulates three concerns into each class: selector knowledge, wait conditions, and data validation. Tests interact with page objects rather than raw selectors.

flowchart TD
    subgraph Test["Test File (WHAT to verify)"]
        T1["await receiveTab.collectAddresses(4)"]
        T2["await sendTab.downloadUnsignedPsbt(path)"]
        T3["await signTab.broadcastAndVerify()"]
    end

    subgraph PageObject["Page Object (HOW to interact)"]
        direction TB
        P1["Selectors:<br/>button:has-text('Next Address')"]
        P2["Wait Contract:<br/>waitForTableDataReady(cellIndex)"]
        P3["Validation:<br/>address must match /^(2[MN]|bcrt1)/"]
    end

    T1 --> PageObject
    T2 --> PageObject
    T3 --> PageObject

    style Test fill:#e8f5e9,stroke:#2e7d32
    style PageObject fill:#e3f2fd,stroke:#1565c0
Loading

The critical concept is the wait contract: every public method in a page object guarantees that the UI is in a stable state when it returns. A contributor calling receiveTab.getCurrentAddress() cannot receive an empty string, because the method internally polls until the cell has content and validates the format before returning.

The WalletPage Split

The old WalletPage handled seven distinct responsibilities. The new structure separates them into focused classes:

flowchart LR
    subgraph Old["Old: WalletPage (god object)"]
        R1["Private client setup"]
        R2["Connection testing"]
        R3["Xpub entry"]
        R4["Config download"]
        R5["Config import"]
        R6["Address import"]
        R7["Tab navigation"]
        R8["Balance check"]
    end

    subgraph New["New: Three focused classes"]
        subgraph WS["WalletSetupPage"]
            N1["Private client setup"]
            N2["Connection testing"]
            N3["Xpub entry"]
            N4["Config download"]
        end
        subgraph WI["WalletImportPage"]
            N5["Config import"]
            N6["Address import"]
        end
        subgraph WN["WalletNavigation"]
            N7["Tab navigation"]
            N8["Balance check"]
        end
    end

    Old --> New
Loading

Each class maps to a distinct UI surface. A contributor writing transaction tests only needs WalletImportPage and WalletNavigation. They never encounter the xpub entry selectors from WalletSetupPage.


Setup Projects and Dependency Management

The old suite used numeric filename prefixes (01-, 02-, 03-) to imply execution order. This ordering was implicit and fragile. If a file was renamed or a new file was added between existing ones, the chain could break silently.

The new configuration uses Playwright's native project dependency system:

flowchart TD
    GS["globalSetup<br/>(Docker, Bitcoin Core, wallets)"]
    GS --> SMOKE["smoke project<br/>smoke.spec.ts"]
    GS --> SETUP["wallet-setup project<br/>wallet.setup.ts"]
    SETUP -->|"dependencies: ['wallet-setup']"| TESTS["wallet-tests project<br/>wallet-display.spec.ts<br/>transaction-flow.spec.ts"]
    TESTS --> GT["globalTeardown<br/>(cleanup)"]

    SMOKE --> GT

    style GS fill:#fff3e0,stroke:#e65100
    style SETUP fill:#e3f2fd,stroke:#1565c0
    style TESTS fill:#e8f5e9,stroke:#2e7d32
    style GT fill:#fff3e0,stroke:#e65100
Loading

The wallet-tests project declares dependencies: ["wallet-setup"] in the Playwright config. Playwright enforces this at the runner level: if any test in wallet-setup fails, all tests in wallet-tests are skipped with a "dependency failed" message in the HTML report. This is more robust than the preconditions system alone because it operates at the project level, before any test code runs.

The three phases serve different purposes:

Phase Project Purpose Can Run in Parallel
Smoke smoke Verify the app loads, connections work Yes, independent
Setup wallet-setup Create wallet, modify config, fund addresses Sequential internally
Tests wallet-tests Verify UI displays correct data, transactions work Potentially parallelizable

The smoke project has no dependencies. It can run concurrently with wallet-setup because it does not need any wallet state. This is the first step toward parallel execution.

Within wallet-tests, the behavioral tests (wallet-display.spec.ts and transaction-flow.spec.ts) are currently run sequentially because they share a funded wallet. However, each test within those files is self-contained through beforeEach wallet re-import, which means future work could isolate them further with per-test wallet prefixes.


Fixture System

Fixtures replace manual object creation with Playwright-managed dependency injection. The difference:

flowchart LR
    subgraph Without["Without Fixtures"]
        direction TB
        I1["import { WalletImportPage } from '../pages/...'"]
        I2["import bitcoinClient from '../services/...'"]
        C1["const walletImport = new WalletImportPage(page)"]
        C2["const client = bitcoinClient()"]
        I1 --> C1
        I2 --> C2
    end

    subgraph With["With Fixtures"]
        direction TB
        I3["import { test } from '../fixtures/caravan.fixture'"]
        I3 --> U1["test('...', async ({ walletImport, btcClient }) => { })"]
    end

    style Without fill:#ffebee,stroke:#c62828
    style With fill:#e8f5e9,stroke:#2e7d32
Loading

Fixture benefits:

  1. Typed autocomplete. Contributors discover available fixtures through IDE suggestions on the destructured parameter object.
  2. Lifecycle management. Playwright creates fixtures before each test and can run teardown logic after. This is the foundation for future per-test wallet isolation.
  3. Single import. Test files import test and expect from one location rather than from multiple modules.
  4. Bitcoin RPC as a fixture. The btcClient fixture makes the service dependency explicit. Every test that uses Bitcoin Core RPC declares it in its parameter list, making the dependency visible in the function signature.

Precondition Validators

Preconditions are circuit breakers that fail immediately with descriptive messages when upstream dependencies are missing.

flowchart TD
    subgraph Without["Without Preconditions"]
        S1["wallet.setup.ts fails at step 2"]
        S2["wallet-display.spec.ts starts"]
        S3["Imports wallet config"]
        S4["Navigates to Receive tab"]
        S5["Extracts addresses"]
        S6["Gets empty data"]
        S7["FAILS with: expect('').toMatch(/^2[MN]/)"]
        S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7
        T1["Elapsed: 2+ minutes"]
        S7 --- T1
    end

    subgraph With["With Preconditions"]
        P1["wallet.setup.ts fails at step 2"]
        P2["wallet-display.spec.ts starts"]
        P3["assertModifiedWalletConfig()"]
        P4["FAILS with:<br/>PRECONDITION FAILED: Wallet network is 'testnet',<br/>expected 'regtest'. Depends on wallet.setup.ts"]
        P1 --> P2 --> P3 --> P4
        T2["Elapsed: < 1 second"]
        P4 --- T2
    end

    style S7 fill:#ffcccc,stroke:#cc0000
    style P4 fill:#fff3e0,stroke:#e65100
    style T1 fill:#ffcccc,stroke:#cc0000
    style T2 fill:#e8f5e9,stroke:#2e7d32
Loading

Four preconditions are provided:

Function What It Validates Used By
assertWalletFileDownloaded File exists, is valid JSON, has xpubs wallet-display, transaction-flow
assertModifiedWalletConfig Network is regtest, all xfps are 8 chars wallet-display, transaction-flow
assertWalletAddressesCollected At least 4 addresses saved in state wallet-display
assertUnsignedPsbtExists PSBT file exists and has content Future fee-bumping tests

Preconditions work as defense-in-depth alongside Playwright's native dependencies. The dependency system prevents tests from running if the setup project failed entirely. Preconditions catch partial failures where the setup project ran but did not complete all steps.


Flakiness Elimination

Every waitForTimeout has been replaced with a condition-based wait. The following table documents each replacement:

Location Old Code Problem New Code
02 beforeEach waitForTimeout(1000) No purpose Removed entirely
02 after confirm click waitForTimeout(1000) Waiting for wallet to load expect(locator).toBeVisible({ timeout: 15000 }) in WalletImportPage.importConfig
02 before table extraction waitForTimeout(2000) Waiting for data to render waitForTableDataReady in ReceiveTab
03 after filling amount waitForTimeout(2000) Waiting for auto coin selection expect(previewBtn).toBeEnabled() in SendTab.previewTransaction
03 after adding signatures waitForTimeout(5000) Waiting for signature processing expect(broadcastBtn).toBeEnabled() in SignTab.broadcastAndVerify
03 after broadcast waitForTimeout(1000) Waiting for success message expect(successMessage).toBeVisible() in SignTab.broadcastAndVerify
selectUTXOs loop waitForTimeout(500) x2 per row Waiting for checkbox to be interactive expect(checkbox).toBeVisible() in SendTab.selectUTXOs

The key fix is waitForTableDataReady in ReceiveTab, which solves the fundamental problem of DOM structure existing before data content:

flowchart TD
    subgraph Old["Old: waitForSelector"]
        O1["waitForSelector('table tbody tr')"]
        O2["Row exists in DOM"]
        O3["page.evaluate reads cells"]
        O4["cells may be empty"]
        O1 --> O2 --> O3 --> O4
    end

    subgraph New["New: waitForTableDataReady"]
        N1["waitForSelector('table tbody tr')"]
        N2["Row exists in DOM"]
        N3["waitForFunction: poll cell content"]
        N4["Cell has text AND length > 0"]
        N5["page.evaluate reads cells"]
        N6["Cells guaranteed to have data"]
        N1 --> N2 --> N3 --> N4 --> N5 --> N6
    end

    style O4 fill:#ffcccc,stroke:#cc0000
    style N6 fill:#e8f5e9,stroke:#2e7d32
Loading

The CI configuration also adds retries: process.env.CI ? 1 : 0. This provides one retry for transient Docker and RPC flakiness in CI while keeping local runs strict with zero retries. A retry is not hiding bugs; it acknowledges that E2E tests involving Docker containers, Bitcoin Core RPC, and browser automation have more moving parts than unit tests, and infrastructure-level transients are a different category from application bugs.


Arrange-Act-Assert Discipline

The old test 02 was a single 100-line test function that performed wallet import, address collection, funding, tab switching, data extraction, balance assertion, block mining, refresh, and final balance verification. When it failed, the error message did not indicate which behavior was broken.

The rewrite separates these into focused tests, each following the Arrange-Act-Assert pattern:

flowchart TD
    subgraph Old["Old: test 02 (single monolithic test)"]
        direction TB
        A1["Import wallet config"] --> A2["Navigate to Receive tab"]
        A2 --> A3["Collect 4 addresses"]
        A3 --> A4["Fund each address"]
        A4 --> A5["Wait for path update"]
        A5 --> A6["Switch to Transactions tab"]
        A6 --> A7["Refresh"]
        A7 --> A8["Switch to Addresses tab"]
        A8 --> A9["Extract address table"]
        A9 --> A10["Assert address formats"]
        A10 --> A11["Assert total balance"]
        A11 --> A12["Mine confirmation blocks"]
        A12 --> A13["Refresh"]
        A13 --> A14["Assert confirmed balance"]
    end

    subgraph New["New: wallet.setup.ts + wallet-display.spec.ts"]
        direction TB
        subgraph Setup["wallet.setup.ts (infrastructure)"]
            S1["Import wallet"]
            S2["Collect addresses"]
            S3["Fund addresses"]
            S4["Mine blocks"]
            S5["Save state"]
            S1 --> S2 --> S3 --> S4 --> S5
        end
        subgraph Tests["wallet-display.spec.ts (behavioral)"]
            T1["Test: addresses show correct format<br/>Arrange: import wallet<br/>Act: switch to Receive, read address<br/>Assert: matches /^2[MN]/"]
            T2["Test: path index advances after deposits<br/>Arrange: import wallet<br/>Act: switch to Receive, poll path suffix<br/>Assert: index equals '4'"]
            T3["Test: funded addresses show correct balance<br/>Arrange: import wallet<br/>Act: switch to Addresses, extract table<br/>Assert: total balance equals 8"]
            T4["Test: balance display shows confirmed total<br/>Arrange: import wallet<br/>Act: refresh<br/>Assert: display shows '8 BTC'"]
        end
        Setup --> Tests
    end

    style Old fill:#ffebee,stroke:#c62828
    style Setup fill:#e3f2fd,stroke:#1565c0
    style Tests fill:#e8f5e9,stroke:#2e7d32
Loading

When a focused test fails, the test name itself tells you what broke. "addresses show correct format" failing means the address rendering is wrong. "balance display shows confirmed total" failing means the balance calculation or display is wrong. There is no need to read through 100 lines of test code to locate the failure point.


How to Run

No changes to the test commands:

# Full suite (all three projects in dependency order)
npm run test:e2e

# With browser visible
npm run test:e2e:headed

# Interactive UI mode
npm run test:e2e:ui

# Run only smoke tests
npx playwright test --config=e2e/playwright.config.ts --project=smoke

# Run only behavioral tests (will trigger wallet-setup first)
npx playwright test --config=e2e/playwright.config.ts --project=wallet-tests

Future Work

These improvements are intentionally deferred from this PR to keep the scope focused:

  1. data-testid attributes. Adding data-testid="{scope}-{name}-{type}" to React components that E2E tests interact with. This decouples tests from button text and CSS structure. The CONTRIBUTING.md documents the naming convention for when contributors modify those components.

  2. Per-test wallet isolation. The btcClient fixture opens the door to creating isolated wallet namespaces per test, eliminating the remaining shared state between behavioral tests.

  3. Test data factories. A createFundedMultisigWallet(options) factory that encapsulates the entire wallet creation, funding, and state setup flow. This would let new tests create custom environments without understanding the setup chain.

  4. API-level seeding for address verification. The address collection step in wallet.setup.ts currently navigates the Caravan UI. Since addresses are deterministic from the xpubs, they could be derived directly from the wallet config, using the browser only for verification.


cc : @bucko13

@changeset-bot
Copy link

changeset-bot bot commented Feb 23, 2026

⚠️ No Changeset found

Latest commit: e79c550

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Feb 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
caravan-coordinator Ready Ready Preview, Comment Feb 24, 2026 3:42pm

Request Review


async goto() {
await this.page.goto("/");
await this.page.waitForLoadState("networkidle");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me for the most part. there is one instance of a page.waitforloadstate("networkidle") on the HomePage page object which could cause an issue if there is polling. other than that pages look good and tests are structured under a spec per suite type (ie: smoke ) with simple tests per type so I don't see a need for test.steps

* Does NOT wait for tab content — individual tab page objects
* handle their own content readiness in their methods.
*/
async switchToTab(tab: "Receive" | "Send" | "Transactions" | "Addresses") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit that could be beneficial is the switchtoTab Wallet navigation function could use constants for the tabs. Making it 1 location where tab names would need to be updated rather than per test, if they are updated

import { clientConfig } from "../services/bitcoinClient";
import fs from "fs";

test.describe("Wallet Setup", () => {
Copy link

@jgiggler jgiggler Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these tests are based off the previous test passing this should be .serial, basing this off of the comment of WHAT IT DOES. I know serial tests are not recommended as tests should be isolated but if they are this would be helpful. https://playwright.dev/docs/test-parallel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants