E2E Test Suite Rewrite: Page Object Model, Setup Projects, and Flakiness Elimination#487
E2E Test Suite Rewrite: Page Object Model, Setup Projects, and Flakiness Elimination#487Legend101Zz wants to merge 7 commits intocaravan-bitcoin:mainfrom
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
||
| async goto() { | ||
| await this.page.goto("/"); | ||
| await this.page.waitForLoadState("networkidle"); |
There was a problem hiding this comment.
this looks good to me for the most part. there is one instance of a page.waitforloadstate("networkidle") on the HomePage page object which could cause an issue if there is polling. other than that pages look good and tests are structured under a spec per suite type (ie: smoke ) with simple tests per type so I don't see a need for test.steps
| * Does NOT wait for tab content — individual tab page objects | ||
| * handle their own content readiness in their methods. | ||
| */ | ||
| async switchToTab(tab: "Receive" | "Send" | "Transactions" | "Addresses") { |
There was a problem hiding this comment.
a nit that could be beneficial is the switchtoTab Wallet navigation function could use constants for the tabs. Making it 1 location where tab names would need to be updated rather than per test, if they are updated
| import { clientConfig } from "../services/bitcoinClient"; | ||
| import fs from "fs"; | ||
|
|
||
| test.describe("Wallet Setup", () => { |
There was a problem hiding this comment.
if these tests are based off the previous test passing this should be .serial, basing this off of the comment of WHAT IT DOES. I know serial tests are not recommended as tests should be isolated but if they are this would be helpful. https://playwright.dev/docs/test-parallel
Summary
This PR is a complete rewrite of the E2E test infrastructure. It addresses two CI-blocking test failures, eliminates all sources of nondeterministic flakiness, and restructures the suite so that future contributions are guided into correct patterns by default.
The rewrite introduces the Page Object Model, Playwright Setup Projects with explicit dependency management, custom fixtures for dependency injection, precondition validators as circuit breakers, and a strict separation between infrastructure setup and behavioral assertions.
Zero
waitForTimeoutcalls remain in the codebase. Every wait is now condition-based.The Problem
Two tests fail consistently in CI while passing locally:
Failure 1 —
02-wallet_regtest.spec.ts: Address extraction returns an empty string.Failure 2 —
03-transaction_flow.spec.ts: Download button times out.These are not isolated bugs. They are symptoms of structural problems that would keep producing new failures as the test suite grows.
Root Cause Analysis
Failure 1: Empty Address Extraction
The failure chain:
flowchart TD A["extractReceiveTableData calls<br/>waitForSelector('table tbody tr')"] --> B["Waits for ROW to exist in DOM"] B --> C["React renders empty skeleton row"] C --> D["waitForSelector returns: row exists"] D --> E["page.evaluate reads cells[4].textContent"] E --> F["Cell is empty: returns ''"] F --> G["getCurrentReceiveAddress returns ''"] G --> H["Test uses '' as a real address"] H --> I["expect('').toMatch(/^2[MN]/) -- FAIL"] style A fill:#f5f5f5,stroke:#333 style F fill:#ffcccc,stroke:#cc0000 style I fill:#ffcccc,stroke:#cc0000The core issue is that
waitForSelector("table tbody tr")checks for DOM structure, not data content. React frequently renders table rows before populating them with data. In CI, where container resources are shared and slower, the gap between "row exists" and "row has data" is wider than on local machines.Additionally, there is an off-by-one bug in the address collection loop. The condition
if (i < 4)is always true wheniranges from 0 to 3, causing the loop to click "Next Address" four times instead of three, navigating past the intended range.Failure 2: Download Button Timeout
The failure chain:
flowchart TD A["page.waitForEvent('download') registered"] --> B["10-second timer starts counting"] B --> C["page.waitForSelector('Download button')"] C --> D["PSBT construction takes 6+ seconds in CI"] D --> E["Button appears at second 7"] E --> F["Only 3 seconds remain on download timer"] F --> G["Click triggers download"] G --> H["Download takes > 3 seconds"] H --> I["Timer expires -- TIMEOUT"] style A fill:#ffeecc,stroke:#cc8800 style B fill:#ffeecc,stroke:#cc8800 style I fill:#ffcccc,stroke:#cc0000The download event listener is registered before confirming the button exists. The listener's internal timeout starts immediately, but the button may not appear for several seconds while the PSBT is being constructed. By the time the button appears and the click happens, most of the timeout budget has been consumed.
The correct pattern is: confirm button exists, then register listener, then click. This ensures the full timeout is available for the actual download.
Structural Issues
Beyond the two failures, our current e2e system had these systemic problems:
waitForTimeoutused as readiness signalArchitectural Changes
The rewrite touches every layer of the test infrastructure. This diagram shows the before and after:
flowchart LR subgraph BEFORE["Before: Flat structure"] direction TB T1["01-multisig_setup.spec.ts"] T2["02-wallet_regtest.spec.ts"] T3["03-transaction_flow.spec.ts"] TH["testhelpers/"] UT["utils/"] T1 -->|"implicit ordering"| T2 T2 -->|"implicit ordering"| T3 T1 --> TH T2 --> TH T3 --> TH T1 --> UT T2 --> UT T3 --> UT end subgraph AFTER["After: Layered architecture"] direction TB subgraph Projects SM["smoke.spec.ts"] WS["wallet.setup.ts"] WD["wallet-display.spec.ts"] TF["transaction-flow.spec.ts"] end subgraph PageObjects["pages/"] PO1["WalletSetupPage"] PO2["WalletImportPage"] PO3["WalletNavigation"] PO4["ReceiveTab / SendTab / SignTab"] end subgraph Infrastructure["fixtures/ + services/ + state/"] FX["caravan.fixture.ts"] PC["preconditions.ts"] SV["bitcoinClient / bitcoinServices"] ST["testState"] end WS -->|"Playwright dependencies"| WD WS -->|"Playwright dependencies"| TF WD --> PO2 WD --> PO3 TF --> PO2 TF --> PO4 PO2 --> FX FX --> SV FX --> ST WD --> PC TF --> PC endKey decisions:
Directory Structure
The separation between directories enforces a rule:
pages/files import fromstate/types.tsbut never fromservices/.tests/files import fromfixtures/but never directly frompages/.services/files never import frompages/. These boundaries prevent the kind of entanglement that made the old structure difficult to modify.Page Object Model
The Page Object Model encapsulates three concerns into each class: selector knowledge, wait conditions, and data validation. Tests interact with page objects rather than raw selectors.
flowchart TD subgraph Test["Test File (WHAT to verify)"] T1["await receiveTab.collectAddresses(4)"] T2["await sendTab.downloadUnsignedPsbt(path)"] T3["await signTab.broadcastAndVerify()"] end subgraph PageObject["Page Object (HOW to interact)"] direction TB P1["Selectors:<br/>button:has-text('Next Address')"] P2["Wait Contract:<br/>waitForTableDataReady(cellIndex)"] P3["Validation:<br/>address must match /^(2[MN]|bcrt1)/"] end T1 --> PageObject T2 --> PageObject T3 --> PageObject style Test fill:#e8f5e9,stroke:#2e7d32 style PageObject fill:#e3f2fd,stroke:#1565c0The critical concept is the wait contract: every public method in a page object guarantees that the UI is in a stable state when it returns. A contributor calling
receiveTab.getCurrentAddress()cannot receive an empty string, because the method internally polls until the cell has content and validates the format before returning.The WalletPage Split
The old
WalletPagehandled seven distinct responsibilities. The new structure separates them into focused classes:flowchart LR subgraph Old["Old: WalletPage (god object)"] R1["Private client setup"] R2["Connection testing"] R3["Xpub entry"] R4["Config download"] R5["Config import"] R6["Address import"] R7["Tab navigation"] R8["Balance check"] end subgraph New["New: Three focused classes"] subgraph WS["WalletSetupPage"] N1["Private client setup"] N2["Connection testing"] N3["Xpub entry"] N4["Config download"] end subgraph WI["WalletImportPage"] N5["Config import"] N6["Address import"] end subgraph WN["WalletNavigation"] N7["Tab navigation"] N8["Balance check"] end end Old --> NewEach class maps to a distinct UI surface. A contributor writing transaction tests only needs
WalletImportPageandWalletNavigation. They never encounter the xpub entry selectors fromWalletSetupPage.Setup Projects and Dependency Management
The old suite used numeric filename prefixes (
01-,02-,03-) to imply execution order. This ordering was implicit and fragile. If a file was renamed or a new file was added between existing ones, the chain could break silently.The new configuration uses Playwright's native project dependency system:
flowchart TD GS["globalSetup<br/>(Docker, Bitcoin Core, wallets)"] GS --> SMOKE["smoke project<br/>smoke.spec.ts"] GS --> SETUP["wallet-setup project<br/>wallet.setup.ts"] SETUP -->|"dependencies: ['wallet-setup']"| TESTS["wallet-tests project<br/>wallet-display.spec.ts<br/>transaction-flow.spec.ts"] TESTS --> GT["globalTeardown<br/>(cleanup)"] SMOKE --> GT style GS fill:#fff3e0,stroke:#e65100 style SETUP fill:#e3f2fd,stroke:#1565c0 style TESTS fill:#e8f5e9,stroke:#2e7d32 style GT fill:#fff3e0,stroke:#e65100The
wallet-testsproject declaresdependencies: ["wallet-setup"]in the Playwright config. Playwright enforces this at the runner level: if any test inwallet-setupfails, all tests inwallet-testsare skipped with a "dependency failed" message in the HTML report. This is more robust than the preconditions system alone because it operates at the project level, before any test code runs.The three phases serve different purposes:
smokewallet-setupwallet-testsThe
smokeproject has no dependencies. It can run concurrently withwallet-setupbecause it does not need any wallet state. This is the first step toward parallel execution.Within
wallet-tests, the behavioral tests (wallet-display.spec.tsandtransaction-flow.spec.ts) are currently run sequentially because they share a funded wallet. However, each test within those files is self-contained throughbeforeEachwallet re-import, which means future work could isolate them further with per-test wallet prefixes.Fixture System
Fixtures replace manual object creation with Playwright-managed dependency injection. The difference:
flowchart LR subgraph Without["Without Fixtures"] direction TB I1["import { WalletImportPage } from '../pages/...'"] I2["import bitcoinClient from '../services/...'"] C1["const walletImport = new WalletImportPage(page)"] C2["const client = bitcoinClient()"] I1 --> C1 I2 --> C2 end subgraph With["With Fixtures"] direction TB I3["import { test } from '../fixtures/caravan.fixture'"] I3 --> U1["test('...', async ({ walletImport, btcClient }) => { })"] end style Without fill:#ffebee,stroke:#c62828 style With fill:#e8f5e9,stroke:#2e7d32Fixture benefits:
testandexpectfrom one location rather than from multiple modules.btcClientfixture makes the service dependency explicit. Every test that uses Bitcoin Core RPC declares it in its parameter list, making the dependency visible in the function signature.Precondition Validators
Preconditions are circuit breakers that fail immediately with descriptive messages when upstream dependencies are missing.
flowchart TD subgraph Without["Without Preconditions"] S1["wallet.setup.ts fails at step 2"] S2["wallet-display.spec.ts starts"] S3["Imports wallet config"] S4["Navigates to Receive tab"] S5["Extracts addresses"] S6["Gets empty data"] S7["FAILS with: expect('').toMatch(/^2[MN]/)"] S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 T1["Elapsed: 2+ minutes"] S7 --- T1 end subgraph With["With Preconditions"] P1["wallet.setup.ts fails at step 2"] P2["wallet-display.spec.ts starts"] P3["assertModifiedWalletConfig()"] P4["FAILS with:<br/>PRECONDITION FAILED: Wallet network is 'testnet',<br/>expected 'regtest'. Depends on wallet.setup.ts"] P1 --> P2 --> P3 --> P4 T2["Elapsed: < 1 second"] P4 --- T2 end style S7 fill:#ffcccc,stroke:#cc0000 style P4 fill:#fff3e0,stroke:#e65100 style T1 fill:#ffcccc,stroke:#cc0000 style T2 fill:#e8f5e9,stroke:#2e7d32Four preconditions are provided:
assertWalletFileDownloadedwallet-display,transaction-flowassertModifiedWalletConfigwallet-display,transaction-flowassertWalletAddressesCollectedwallet-displayassertUnsignedPsbtExistsPreconditions work as defense-in-depth alongside Playwright's native
dependencies. The dependency system prevents tests from running if the setup project failed entirely. Preconditions catch partial failures where the setup project ran but did not complete all steps.Flakiness Elimination
Every
waitForTimeouthas been replaced with a condition-based wait. The following table documents each replacement:02beforeEachwaitForTimeout(1000)02after confirm clickwaitForTimeout(1000)expect(locator).toBeVisible({ timeout: 15000 })inWalletImportPage.importConfig02before table extractionwaitForTimeout(2000)waitForTableDataReadyinReceiveTab03after filling amountwaitForTimeout(2000)expect(previewBtn).toBeEnabled()inSendTab.previewTransaction03after adding signatureswaitForTimeout(5000)expect(broadcastBtn).toBeEnabled()inSignTab.broadcastAndVerify03after broadcastwaitForTimeout(1000)expect(successMessage).toBeVisible()inSignTab.broadcastAndVerifyselectUTXOsloopwaitForTimeout(500)x2 per rowexpect(checkbox).toBeVisible()inSendTab.selectUTXOsThe key fix is
waitForTableDataReadyinReceiveTab, which solves the fundamental problem of DOM structure existing before data content:flowchart TD subgraph Old["Old: waitForSelector"] O1["waitForSelector('table tbody tr')"] O2["Row exists in DOM"] O3["page.evaluate reads cells"] O4["cells may be empty"] O1 --> O2 --> O3 --> O4 end subgraph New["New: waitForTableDataReady"] N1["waitForSelector('table tbody tr')"] N2["Row exists in DOM"] N3["waitForFunction: poll cell content"] N4["Cell has text AND length > 0"] N5["page.evaluate reads cells"] N6["Cells guaranteed to have data"] N1 --> N2 --> N3 --> N4 --> N5 --> N6 end style O4 fill:#ffcccc,stroke:#cc0000 style N6 fill:#e8f5e9,stroke:#2e7d32The CI configuration also adds
retries: process.env.CI ? 1 : 0. This provides one retry for transient Docker and RPC flakiness in CI while keeping local runs strict with zero retries. A retry is not hiding bugs; it acknowledges that E2E tests involving Docker containers, Bitcoin Core RPC, and browser automation have more moving parts than unit tests, and infrastructure-level transients are a different category from application bugs.Arrange-Act-Assert Discipline
The old test 02 was a single 100-line test function that performed wallet import, address collection, funding, tab switching, data extraction, balance assertion, block mining, refresh, and final balance verification. When it failed, the error message did not indicate which behavior was broken.
The rewrite separates these into focused tests, each following the Arrange-Act-Assert pattern:
flowchart TD subgraph Old["Old: test 02 (single monolithic test)"] direction TB A1["Import wallet config"] --> A2["Navigate to Receive tab"] A2 --> A3["Collect 4 addresses"] A3 --> A4["Fund each address"] A4 --> A5["Wait for path update"] A5 --> A6["Switch to Transactions tab"] A6 --> A7["Refresh"] A7 --> A8["Switch to Addresses tab"] A8 --> A9["Extract address table"] A9 --> A10["Assert address formats"] A10 --> A11["Assert total balance"] A11 --> A12["Mine confirmation blocks"] A12 --> A13["Refresh"] A13 --> A14["Assert confirmed balance"] end subgraph New["New: wallet.setup.ts + wallet-display.spec.ts"] direction TB subgraph Setup["wallet.setup.ts (infrastructure)"] S1["Import wallet"] S2["Collect addresses"] S3["Fund addresses"] S4["Mine blocks"] S5["Save state"] S1 --> S2 --> S3 --> S4 --> S5 end subgraph Tests["wallet-display.spec.ts (behavioral)"] T1["Test: addresses show correct format<br/>Arrange: import wallet<br/>Act: switch to Receive, read address<br/>Assert: matches /^2[MN]/"] T2["Test: path index advances after deposits<br/>Arrange: import wallet<br/>Act: switch to Receive, poll path suffix<br/>Assert: index equals '4'"] T3["Test: funded addresses show correct balance<br/>Arrange: import wallet<br/>Act: switch to Addresses, extract table<br/>Assert: total balance equals 8"] T4["Test: balance display shows confirmed total<br/>Arrange: import wallet<br/>Act: refresh<br/>Assert: display shows '8 BTC'"] end Setup --> Tests end style Old fill:#ffebee,stroke:#c62828 style Setup fill:#e3f2fd,stroke:#1565c0 style Tests fill:#e8f5e9,stroke:#2e7d32When a focused test fails, the test name itself tells you what broke. "addresses show correct format" failing means the address rendering is wrong. "balance display shows confirmed total" failing means the balance calculation or display is wrong. There is no need to read through 100 lines of test code to locate the failure point.
How to Run
No changes to the test commands:
Future Work
These improvements are intentionally deferred from this PR to keep the scope focused:
data-testid attributes. Adding
data-testid="{scope}-{name}-{type}"to React components that E2E tests interact with. This decouples tests from button text and CSS structure. The CONTRIBUTING.md documents the naming convention for when contributors modify those components.Per-test wallet isolation. The
btcClientfixture opens the door to creating isolated wallet namespaces per test, eliminating the remaining shared state between behavioral tests.Test data factories. A
createFundedMultisigWallet(options)factory that encapsulates the entire wallet creation, funding, and state setup flow. This would let new tests create custom environments without understanding the setup chain.API-level seeding for address verification. The address collection step in
wallet.setup.tscurrently navigates the Caravan UI. Since addresses are deterministic from the xpubs, they could be derived directly from the wallet config, using the browser only for verification.cc : @bucko13