Skip to content

Add 429 handling, HTTP timeouts, and per-record exception isolation#5

Open
rdhyee wants to merge 2 commits intomainfrom
fix/harvest-reliability
Open

Add 429 handling, HTTP timeouts, and per-record exception isolation#5
rdhyee wants to merge 2 commits intomainfrom
fix/harvest-reliability

Conversation

@rdhyee
Copy link

@rdhyee rdhyee commented Mar 12, 2026

Summary

Ports the reliability fixes from Gluejar/regluit (issue #1096, PRs #1097–#1108) to doab-check. Same root causes identified during the regluit DOAB harvester investigation.

Changes

doab_oai.py — OAI harvest loop:

  • limit is not None guard (Python 3 TypeError on int >= None)
  • Per-record try/except Exception with logger.exception() — one bad record no longer kills the whole harvest
  • urllib.error.HTTPError handler for 429 — logs Retry-After header and returns gracefully

doab_utils.py — bitstream API:

  • timeout=(5, 60) on requests.get() in get_streamdata()
  • 429 status check before calling .json() (429 returns HTML, causing misleading ValueError)

check.py — link checker / ContentTyper:

  • timeout=(5, 60) on all 6 requests.get() calls
  • requests.exceptions.Timeout handler returning (524, '', '') (consistent with existing timeout handling)

Context

These are the same bugs that caused regluit's DOAB harvester to miss hundreds of books over several months. doab-check's codebase shares the same patterns and was vulnerable to the same failures.

Closes #2, closes #3, closes #4.

Ports reliability fixes from Gluejar/regluit (PRs #1097, #1099, #1101,
#1106, #1107, #1108) to doab-check. Same root causes, same fixes:

- HTTP 429 rate-limit handling in OAI harvest loop (urllib.error.HTTPError)
  and in get_streamdata (response.status_code check before .json())
- timeout=(5, 60) on all requests.get() calls (doab_utils, check.py)
- Per-record try/except with logger.exception() in load_doab_oai
- limit=None guard (num_doabs >= None is TypeError in Python 3)
- Timeout exception handler in ContentTyper (returns 524 tuple)

Closes #2, closes #3, closes #4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When an SSL error triggered a retry with verify=False, a timeout on
that retry would propagate uncaught instead of returning (524, '', '').
Wrap the SSL fallback in its own try/except to handle both Timeout
and other exceptions gracefully.

Found by Codex code review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rdhyee rdhyee moved this to In Review in Unglue.it Modernization Mar 18, 2026
@rdhyee rdhyee added ry:approved Raymond has reviewed, understands, and approves needs-eric-review On agenda for Eric's review/decision labels Mar 19, 2026
@rdhyee
Copy link
Author

rdhyee commented Mar 19, 2026

Eric approves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-eric-review On agenda for Eric's review/decision ry:approved Raymond has reviewed, understands, and approves

Projects

None yet

1 participant