Skip to content

Conversation

@amcaplan
Copy link
Contributor

@amcaplan amcaplan commented Oct 30, 2025

Summary

Addresses shop/issues-liquid-storefronts#10783

Fixes 6,941+ production errors that were incorrectly classified as BugError when they should be user-facing AbortError.

The main issue was TLS certificate validation failures (e.g., "Hostname/IP does not match certificate's altnames") being treated as bugs rather than network connectivity issues.

Changes

1. Renamed isARetryableNetworkErrorisTransientNetworkError

  • Old name was confusing in contexts where retries had already been exhausted
  • New name accurately describes what these errors are: transient network issues
  • Added comprehensive JSDoc explaining the purpose

2. Exported function for reuse

  • Made function available to other modules (previously private)
  • Keeps error detection logic DRY

3. Enhanced network error detection

Added patterns for production errors that were being missed:

  • 'timeout' - Generic network timeouts
  • 'premature close' - Connection closed prematurely
  • 'certificate', 'cert', 'tls', 'ssl', 'altnames' - TLS/certificate errors
  • 'getaddrinfo' - DNS resolution failures

4. Updated fetchApiVersions error handling

Now checks for transient network errors and classifies them as AbortError with a user-friendly message instead of BugError.

Production Impact

Errors now properly classified as network issues (from Observe data):

  • ✅ TLS certificate validation failures (6,941 occurrences - the main issue)
  • ✅ ECONNRESET (connection resets)
  • ✅ ENOTFOUND (DNS failures)
  • ✅ ETIMEDOUT (timeouts)
  • ✅ Socket hang ups
  • ✅ Premature connection closes
  • ✅ TLS handshake failures

Test Plan

  • All existing tests pass
  • TypeScript compilation succeeds
  • Verified all 8 production error patterns are now correctly detected

🤖 Generated with Claude Code

@amcaplan amcaplan requested review from a team as code owners October 30, 2025 16:46
@github-actions
Copy link
Contributor

github-actions bot commented Oct 30, 2025

Coverage report

St.
Category Percentage Covered / Total
🟡 Statements
79.11% (+0.01% 🔼)
13517/17087
🟡 Branches
72.89% (+0.01% 🔼)
6594/9047
🟡 Functions
79.23% (+0.01% 🔼)
3480/4392
🟡 Lines
79.46% (+0.01% 🔼)
12768/16068
Show files with reduced coverage 🔻
St.
File Statements Branches Functions Lines
🟡
... / admin.ts
72.22% (-2.78% 🔻)
40% (-5.16% 🔻)
90.91%
73.58% (-2.89% 🔻)

Test suite run success

3330 tests passing in 1361 suites.

Report generated by 🧪jest coverage report action from 771d001

Copy link
Contributor

@EvilGenius13 EvilGenius13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @amcaplan!

I like the new error wording explicitly stating it's a network error.

I do want to point out that some of the changes will catch more errors than just transient. For instance, certificate errors can be permanent as well and we will retry for no reason if we only look for the word itself. However, the trade-off might be worth it if we are helping the majority of users having blips in their network.

@amcaplan
Copy link
Contributor Author

@EvilGenius13 There's definitely a conversation to be had here. Maybe there's a wider category of network errors, and a narrower category of retryable network errors. Cert errors are unlikely to just recover, unless caused by a momentary blip in system clock accuracy. But some others in the list are worth retrying. And I think in all cases, it's correct to treat these as Handled errors since there isn't much we can do.

By the way, I suspect many of these are caused by users who set up SSL certs incorrectly for custom domains.

Copy link
Contributor

@plvaldes plvaldes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files covered by unit tests that we may want to extend with new checks?

@amcaplan amcaplan force-pushed the treat-network-errors-as-handled branch from da44b47 to 25e6449 Compare November 2, 2025 13:46
amcaplan and others added 4 commits November 2, 2025 15:51
Renames isARetryableNetworkError to isTransientNetworkError for clarity
and exports it for reuse in fetchApiVersions error handling. This fixes
6,941+ production errors that were incorrectly classified as bugs.

Changes:
- Renamed isARetryableNetworkError -> isTransientNetworkError
- Added JSDoc explaining these are transient network errors
- Exported function for reuse across the codebase
- Enhanced detection patterns to include TLS/cert errors, premature close, timeouts
- Updated fetchApiVersions to classify network errors as AbortError

Network errors now properly classified include:
- TLS certificate validation failures (main issue with 6,941 occurrences)
- Connection timeouts (ETIMEDOUT)
- Connection resets (ECONNRESET)
- DNS failures (ENOTFOUND, getaddrinfo)
- Socket disconnects and premature closes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Addresses PR review feedback about retrying permanent network errors.

Changes:
- Split isTransientNetworkError (for retry logic) from isNetworkError (for error classification)
- isTransientNetworkError: Only includes retryable errors (timeouts, DNS failures, connection issues)
- isNetworkError: Includes all network errors (transient + permanent SSL/TLS/certificate errors)
- Updated fetchApiVersions to use isNetworkError for error classification
- Eliminated code duplication by having isNetworkError call isTransientNetworkError

Impact:
- Certificate validation failures are still reported as user-facing AbortErrors (not bugs)
- But they are no longer retried unnecessarily
- All existing tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Tests verify the key behaviors from the PR:
1. Certificate/TLS/SSL errors are detected as network errors (not bugs)
2. Certificate errors are NOT retried (they're permanent)
3. Transient errors (timeouts, DNS failures) ARE retried
4. All production error patterns from the PR are covered

New test coverage:
- isTransientNetworkError: 14 transient error patterns, blank reasons
- isNetworkError: All transient errors + permanent cert/TLS/SSL errors
- Retry behavior: Verifies cert errors fail immediately without retries
- Edge cases: Non-Error objects, non-network errors

This ensures the fix for 6,941+ production certificate errors works correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Refactored the certificate error test to use Promise.all() instead of
awaiting in a loop, which is more efficient and satisfies the
no-await-in-loop eslint rule.

The test still verifies that all 4 certificate error types are not retried,
but now runs them in parallel instead of sequentially.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@amcaplan amcaplan force-pushed the treat-network-errors-as-handled branch from 25e6449 to 0342814 Compare November 2, 2025 13:57
@amcaplan
Copy link
Contributor Author

amcaplan commented Nov 2, 2025

I've updated the PR with a lengthy series of tests, as well as distinguishing between transient network errors (should be retried) vs network errors generally (which might not fix themselves but aren't a CLI bug - so should be AbortError not BugError).

@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2025

Differences in type declarations

We detected differences in the type declarations generated by Typescript for this branch compared to the baseline ('main' branch). Please, review them to ensure they are backward-compatible. Here are some important things to keep in mind:

  • Some seemingly private modules might be re-exported through public modules.
  • If the branch is behind main you might see odd diffs, rebase main into this branch.

New type declarations

We found no new type declarations in this PR

Existing type declarations

packages/cli-kit/dist/private/node/api.d.ts
@@ -11,6 +11,30 @@ type RequestOptions<T> = {
     request: () => Promise<T>;
     url: string;
 } & NetworkRetryBehaviour;
+/**
+ * Checks if an error is a transient network error that is likely to recover with retries.
+ *
+ * Use this function for retry logic. Use isNetworkError for error classification.
+ *
+ * Examples of transient errors (worth retrying):
+ * - Connection timeouts, resets, and aborts
+ * - DNS failures (enotfound, getaddrinfo, eai_again) - can be temporary
+ * - Socket disconnects and hang ups
+ * - Premature connection closes
+ */
+export declare function isTransientNetworkError(error: unknown): boolean;
+/**
+ * Checks if an error is any kind of network-related error (connection issues, timeouts, DNS failures,
+ * TLS/certificate errors, etc.) rather than an application logic error.
+ *
+ * These errors should be reported as user-facing errors (AbortError) rather than bugs (BugError),
+ * regardless of whether they are transient or permanent.
+ *
+ * Examples include:
+ * - Transient: connection timeouts, socket hang ups, temporary DNS failures
+ * - Permanent: certificate validation failures, misconfigured SSL
+ */
+export declare function isNetworkError(error: unknown): boolean;
 export declare function simpleRequestWithDebugLog<T extends {
     headers: Headers;
     status: number;

Copy link
Contributor

@EvilGenius13 EvilGenius13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @amcaplan! 🚀

@amcaplan amcaplan added this pull request to the merge queue Nov 3, 2025
Merged via the queue into main with commit 115f677 Nov 3, 2025
30 checks passed
@amcaplan amcaplan deleted the treat-network-errors-as-handled branch November 3, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants