Skip to content

jobs: factorise common job logic into BaseJob (bug 1983208)#475

Merged
shtrom merged 17 commits intomainfrom
bug1983208/job-model
Aug 25, 2025
Merged

jobs: factorise common job logic into BaseJob (bug 1983208)#475
shtrom merged 17 commits intomainfrom
bug1983208/job-model

Conversation

@shtrom
Copy link
Member

@shtrom shtrom commented Aug 15, 2025

  • Create a new BaseJob model and move all common attributes from LandingJob and AutomationJob to it, as well as queue-management logic.
  • Also add a to_dict method for various JSON representation needs.
  • Update imports throughout.

@shtrom shtrom requested review from cgsheeh and zzzeid August 15, 2025 05:47
@shtrom shtrom force-pushed the bug1983208/job-model branch from 20883c3 to 73ee347 Compare August 15, 2025 06:14
@shtrom shtrom force-pushed the bug1983208/job-model branch from 73ee347 to 4116cbe Compare August 15, 2025 06:23
@shtrom shtrom marked this pull request as ready for review August 15, 2025 06:34
Copy link
Member

@cgsheeh cgsheeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for cleaning this up!

Please see my comments about tuples in to_dict, but otherwise this LGTM.

Comment on lines +41 to +42
When(status=cls.SUBMITTED, then=1),
When(status=cls.IN_PROGRESS, then=2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should process IN_PROGRESS jobs first, since those would have been at the start of the queue when the worker crashed. Probably out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to not consider IN_PROGRESS jobs as ones we need to process, rather, they should be marked as FAILED if there's a crash. We should be able to automatically detect that when starting a landing worker. This would prevent a loop where a job that causes the worker to crash continuously gets attempted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the case to have IN_PROGESS jobs second is that if said job kills the worker badly, processing another SUBMITTED job next allows us to inch forwards, rather than be in a tight fail-loop that would block the processing of any message. Not great, but it turns a critical failure into a slow down.

There are two cases I can think of where we'd end up in this situation: either a poison pill job, in which case marking as FAILED would make sense, or the worker was killed due to other reasons (restarts, new deploy, cosmic rays, ...), in which case we may not want to set the job to FAILED (though it would be such a rare occurrence that, as long as there are notifications about the failure, we could probably take the hit without too much disruption).

In any case, I'd rather not change this as part of the refactor, so as to not muddle the waters, but I think it's worth considering our options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zzzeid
zzzeid previously requested changes Aug 21, 2025
Copy link
Contributor

@zzzeid zzzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits, just BaseJob.__str__ I think which technically is incorrect right now.

Comment on lines +41 to +42
When(status=cls.SUBMITTED, then=1),
When(status=cls.IN_PROGRESS, then=2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to not consider IN_PROGRESS jobs as ones we need to process, rather, they should be marked as FAILED if there's a crash. We should be able to automatically detect that when starting a landing worker. This would prevent a loop where a job that causes the worker to crash continuously gets attempted.

@shtrom shtrom dismissed zzzeid’s stale review August 22, 2025 03:17

Most addressed, except for IN_PROGRESS management, left for later.

@shtrom shtrom merged commit 18dfe29 into main Aug 25, 2025
1 check passed
@shtrom shtrom deleted the bug1983208/job-model branch August 25, 2025 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants