Skip to content

Conversation

@Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Oct 7, 2025

It turns out we've been missing some records because redirect URLs (in the 3xx response's Location header) do not always exactly match the requested target URL as recorded in the WARC record. This solves the problem by normalizing URLs before trying to match them.

For example, https://www.heat.gov/ recently stopped getting recorded because it started redirecting to https://heat.gov, which got recorded in the WARC as https://heat.gov/. The / path is exactly equivalent to an empty path, but because we were looking for a URL without a path, we didn't find a record, even though the correct matching record was there, just with a / for its path.

While working on this particular case, I found some similar problems arising from variations on this same cause, like having redundant ports (e.g. the 443 is redundant here because it is an https scheme: https://whatever.com:443/).

It turns out we've been missing some records because redirect URLs (in the 3xx response's `Location` header) do not always exactly match the requested target URL as recorded in the WARC record. This solves the problem by normalizing URLs before trying to match them.

For example, `https://www.heat.gov/` recently stopped getting recorded because it started redirecting to `https://heat.gov`, which got recorded in the WARC as `https://heat.gov/`. No path is exactly equivalent to `/` (note this is not the case for other trailing slashes in paths!), but because we were looking for a URL without a path, we didn't find a record, even though the correct matching record was there, just with a `/` for its path.

While working on this particular case, I noted some similar ones arising from variations on this same issue, like having redundant ports (e.g. the `443` is redundant here bcause it is an `https` scheme: `https://whatever.com:443/`).
@Mr0grog Mr0grog merged commit 15bbc66 into main Oct 7, 2025
11 checks passed
@Mr0grog Mr0grog deleted the heat-gov-turned-invisible-to-us branch October 7, 2025 05:05
@github-project-automation github-project-automation bot moved this from Inbox to Done in Web Monitoring Oct 7, 2025
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-crawler that referenced this pull request Oct 7, 2025
Use this actions workflow to manually trigger a crawl's data to be re-imported into web-monitoring-db. Needed to remediate bad imports with the fixes from edgi-govdata-archiving/web-monitoring-processing#886.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants