Skip to content

fix(sitemap-extractor): retry discovery once without proxy#242

Merged
nikitachapovskii-dev merged 4 commits intomasterfrom
fix/retry-discover-sitemaps-wo-proxy
Feb 27, 2026
Merged

fix(sitemap-extractor): retry discovery once without proxy#242
nikitachapovskii-dev merged 4 commits intomasterfrom
fix/retry-discover-sitemaps-wo-proxy

Conversation

@nikitachapovskii-dev
Copy link
Contributor

@nikitachapovskii-dev nikitachapovskii-dev commented Feb 25, 2026

The fix adds

  1. a small fallback: discovery still runs through proxy first, and only if that attempt errors, times out, or returns no sitemaps, we retry once without proxy. If the second attempt also finds nothing, the actor fails as before.

  2. If we rerun discoverValidSitemaps without proxy, discovery succeeds, but the rest of the crawl still runs through proxy. As a result, sitemap responses can come back as non-XML content (likely an anti-bot HTML page), which leads to Unencoded < parsing errors. To avoid shipping a fix that is only partially useful, decided to include handling for these follow-up proxy-related failures in the same PR.

Closes #240

@nikitachapovskii-dev
Copy link
Contributor Author

a68c3ec
this commit makes rest of the run use no proxy if discovery valid sitemaps succeeded only without proxy.

this.proxyConfiguration = undefined;
}

if (!discovered && !discoveryError) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take this with a grain of salt because it's possible that I misunderstood discoverWithTimeout.

There is an edgecase where at the attempt with no proxy we get discovered = [] which fails to activate disableProxyForRun and this check !discovered && !discoveryError. What happens then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discoverWithTimeout can return either undefined (timeout) or string[] (including an empty array).
In the case you described (discovered = [] on the no-proxy retry), we proceed to discoveredSitemaps; it becomes an empty set and we fail with the existing Actor.fail("No valid sitemaps were discovered...")

So this path is handled and still ends in an explicit failure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right it makes sense now, thank you for explaining it.

Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nikitachapovskii-dev !

I'll admit I got lost in reassigning the local variables and the overall logic flow. Can we please extract some of the logic into reusable methods?

Otherwise, the logic seems sound to me - if @ruocco-l is fine with the code (edit: it seems he got confused too 😅 ), feel free to merge even now.

@nikitachapovskii-dev
Copy link
Contributor Author

nikitachapovskii-dev commented Feb 26, 2026

Thanks for providing the review - really appreciate!
@ruocco-l I've just updated the logic in attempt to make it more readable, pls check when you got time.

the new version is tested on local.

Copy link
Collaborator

@ruocco-l ruocco-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spending time making this more readable for us 🙏 Just a nit

return {
...proxyAttempt,
disableProxyForRun: false,
} satisfies SitemapDiscoveryResult;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with the satisfies? Can't you just declare the return type when declaring the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, replaced 😄

@nikitachapovskii-dev nikitachapovskii-dev merged commit bff9c69 into master Feb 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sitemap Extractor: Sitemap discovery fails when proxy is blocked; add fallback without proxy

3 participants