fix(sitemap-extractor): retry discovery once without proxy#242
fix(sitemap-extractor): retry discovery once without proxy#242nikitachapovskii-dev merged 4 commits intomasterfrom
Conversation
|
a68c3ec |
| this.proxyConfiguration = undefined; | ||
| } | ||
|
|
||
| if (!discovered && !discoveryError) { |
There was a problem hiding this comment.
Take this with a grain of salt because it's possible that I misunderstood discoverWithTimeout.
There is an edgecase where at the attempt with no proxy we get discovered = [] which fails to activate disableProxyForRun and this check !discovered && !discoveryError. What happens then?
There was a problem hiding this comment.
discoverWithTimeout can return either undefined (timeout) or string[] (including an empty array).
In the case you described (discovered = [] on the no-proxy retry), we proceed to discoveredSitemaps; it becomes an empty set and we fail with the existing Actor.fail("No valid sitemaps were discovered...")
So this path is handled and still ends in an explicit failure
There was a problem hiding this comment.
All right it makes sense now, thank you for explaining it.
There was a problem hiding this comment.
Thank you @nikitachapovskii-dev !
I'll admit I got lost in reassigning the local variables and the overall logic flow. Can we please extract some of the logic into reusable methods?
Otherwise, the logic seems sound to me - if @ruocco-l is fine with the code (edit: it seems he got confused too 😅 ), feel free to merge even now.
|
Thanks for providing the review - really appreciate! the new version is tested on local. |
ruocco-l
left a comment
There was a problem hiding this comment.
Thank you for spending time making this more readable for us 🙏 Just a nit
| return { | ||
| ...proxyAttempt, | ||
| disableProxyForRun: false, | ||
| } satisfies SitemapDiscoveryResult; |
There was a problem hiding this comment.
What's with the satisfies? Can't you just declare the return type when declaring the function?
There was a problem hiding this comment.
fair enough, replaced 😄
The fix adds
a small fallback: discovery still runs through proxy first, and only if that attempt errors, times out, or returns no sitemaps, we retry once without proxy. If the second attempt also finds nothing, the actor fails as before.
If we rerun discoverValidSitemaps without proxy, discovery succeeds, but the rest of the crawl still runs through proxy. As a result, sitemap responses can come back as non-XML content (likely an anti-bot HTML page), which leads to Unencoded < parsing errors. To avoid shipping a fix that is only partially useful, decided to include handling for these follow-up proxy-related failures in the same PR.
Closes #240