TRON-2477: Do some more validation of TimeSpecs #1072

jfongatyelp · 2025-08-28T20:21:21Z

We ran into an infinite loop in our current TimeSpecification.get_match when asked for the next 30th of februrary, for example.

Other similar impossible job schedules such as April 31 would have run into the same issue. I considered changing this in the schedule parsing logic, but this will catch if (in the extremely slim chance) someone also specified an impossible date using our complex scheduling format (groc daily, value: 30th day in feb).

In this PR, we now do not allow an invalid TimeSpecificaation to be created, and when initially scheduling the next job runs, we check whether we get a valid TimeSpecification back, and if not, simply print an error in our logs and skip that job in our JobCollection.

In general, I do think our yelpsoa-configs validation should prevent us from even getting here again and will provide a better interface to job owners that show the invalid config, but if config on disk/in memory was ever able to get into a bad state like this, these changes would allow tron to continue gracefully, simply skipping that job with an invalid schedule. In addition, config change requests would succeed if provided with a new, valid schedule.

In addition to the TimeSpecification tests, I've tested locally and ensured if tron starts w/ an invalid schedule, it simply skips:

2025-08-28 13:12:22,169 tron.core.job_collection ERROR Failed to build job 'MASTER.testjob0': Day 30 does not exist in any specified month ([2]). Skipping job.

One unintuitive behavior as a result of this change is that the job will also no longer be shown in tronweb, though it persists in config, which could potentially lead to confusion. That said, I think trying to add the job but ensure everything down the line can properly handle+display appropriate warnings when it has an impossible schedule may be more trouble than this edge case is worth, tbh. I'm happy to give it a shot though if ppl think the opposite.

… handling

nemacysts

generally lgtm, just two questions

nemacysts · 2025-08-28T20:29:23Z

tron/core/job_collection.py

+            try:
+                return factory.build(config)
+            except (ValueError, Exception) as e:
+                log.error(f"Failed to build job '{config.name}': {e}. Skipping job.")


i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p

that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?

i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration

i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p

So when I attempt to upload a config w/ a bad schedule, this does fail and api returns an error (though doesn't seem to generate any logs about the failure on the server side 🤔 ), so at least setup_tron_namespace should alert us if this happens. I'll try to see how I can make sure the error is logged tho.

that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?

Great point, will run some more tests on my local run to verify

i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration

For this one, I'm suspecting we'll be ok since I thought the way we do inter-job 'dependencies' were really just based on trigger names, rather than having any actual tie between the jobs/jobruns? But I will definitely also verify this.

Drat, good call -- if the tron daemon starts with a bad config (because we can't send a bad config thru the api anymore while it's running) and skips the job on startup, then if someone pushes a correction, tron will re-start from 0, I'm guessing because recovery from state would only be done at startup time, and not when we re-initialize the job scheduler.

I'll have a dive in to the scheduler to see whether recovery at scheduling time is easy to achieve, otherwise I've half a mind to have to leave this as an Expected Behavior if you've sent an impossible schedule.

jfongatyelp added 4 commits August 27, 2025 13:52

TRON-2477: Validate requested monthday by month

130b745

Also add a failsafe to prevent infinite loops

dd4d832

Skip processing for a job with bad schedule on-disk for more graceful…

a4d23e5

… handling

add tests for validation in TimeSpecification constructor

74fd32d

jfongatyelp requested review from EmanElsaban, EvanKrall, KaspariK and nemacysts August 28, 2025 20:21

jfongatyelp requested a review from a team as a code owner August 28, 2025 20:21

nemacysts reviewed Aug 28, 2025

View reviewed changes

jfongatyelp marked this pull request as draft October 1, 2025 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TRON-2477: Do some more validation of TimeSpecs #1072

TRON-2477: Do some more validation of TimeSpecs #1072

Uh oh!

jfongatyelp commented Aug 28, 2025

Uh oh!

nemacysts left a comment

Uh oh!

nemacysts Aug 28, 2025

Uh oh!

nemacysts Aug 28, 2025

Uh oh!

jfongatyelp Aug 28, 2025

Uh oh!

jfongatyelp Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TRON-2477: Do some more validation of TimeSpecs #1072

Are you sure you want to change the base?

TRON-2477: Do some more validation of TimeSpecs #1072

Uh oh!

Conversation

jfongatyelp commented Aug 28, 2025

Uh oh!

nemacysts left a comment

Choose a reason for hiding this comment

Uh oh!

nemacysts Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

nemacysts Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

jfongatyelp Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

jfongatyelp Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants