Skip to content

Conversation

@jfongatyelp
Copy link
Contributor

We ran into an infinite loop in our current TimeSpecification.get_match when asked for the next 30th of februrary, for example.

Other similar impossible job schedules such as April 31 would have run into the same issue. I considered changing this in the schedule parsing logic, but this will catch if (in the extremely slim chance) someone also specified an impossible date using our complex scheduling format (groc daily, value: 30th day in feb).

In this PR, we now do not allow an invalid TimeSpecificaation to be created, and when initially scheduling the next job runs, we check whether we get a valid TimeSpecification back, and if not, simply print an error in our logs and skip that job in our JobCollection.

In general, I do think our yelpsoa-configs validation should prevent us from even getting here again and will provide a better interface to job owners that show the invalid config, but if config on disk/in memory was ever able to get into a bad state like this, these changes would allow tron to continue gracefully, simply skipping that job with an invalid schedule. In addition, config change requests would succeed if provided with a new, valid schedule.

In addition to the TimeSpecification tests, I've tested locally and ensured if tron starts w/ an invalid schedule, it simply skips:

2025-08-28 13:12:22,169 tron.core.job_collection ERROR Failed to build job 'MASTER.testjob0': Day 30 does not exist in any specified month ([2]). Skipping job.

One unintuitive behavior as a result of this change is that the job will also no longer be shown in tronweb, though it persists in config, which could potentially lead to confusion. That said, I think trying to add the job but ensure everything down the line can properly handle+display appropriate warnings when it has an impossible schedule may be more trouble than this edge case is worth, tbh. I'm happy to give it a shot though if ppl think the opposite.

@jfongatyelp jfongatyelp requested a review from a team as a code owner August 28, 2025 20:21
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm, just two questions

try:
return factory.build(config)
except (ValueError, Exception) as e:
log.error(f"Failed to build job '{config.name}': {e}. Skipping job.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p

that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p

So when I attempt to upload a config w/ a bad schedule, this does fail and api returns an error (though doesn't seem to generate any logs about the failure on the server side 🤔 ), so at least setup_tron_namespace should alert us if this happens. I'll try to see how I can make sure the error is logged tho.

that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?

Great point, will run some more tests on my local run to verify

i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration

For this one, I'm suspecting we'll be ok since I thought the way we do inter-job 'dependencies' were really just based on trigger names, rather than having any actual tie between the jobs/jobruns? But I will definitely also verify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drat, good call -- if the tron daemon starts with a bad config (because we can't send a bad config thru the api anymore while it's running) and skips the job on startup, then if someone pushes a correction, tron will re-start from 0, I'm guessing because recovery from state would only be done at startup time, and not when we re-initialize the job scheduler.

I'll have a dive in to the scheduler to see whether recovery at scheduling time is easy to achieve, otherwise I've half a mind to have to leave this as an Expected Behavior if you've sent an impossible schedule.

@jfongatyelp jfongatyelp marked this pull request as draft October 1, 2025 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants