-
Notifications
You must be signed in to change notification settings - Fork 65
TRON-2477: Do some more validation of TimeSpecs #1072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
nemacysts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally lgtm, just two questions
| try: | ||
| return factory.build(config) | ||
| except (ValueError, Exception) as e: | ||
| log.error(f"Failed to build job '{config.name}': {e}. Skipping job.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p
that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p
So when I attempt to upload a config w/ a bad schedule, this does fail and api returns an error (though doesn't seem to generate any logs about the failure on the server side 🤔 ), so at least setup_tron_namespace should alert us if this happens. I'll try to see how I can make sure the error is logged tho.
that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?
Great point, will run some more tests on my local run to verify
i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration
For this one, I'm suspecting we'll be ok since I thought the way we do inter-job 'dependencies' were really just based on trigger names, rather than having any actual tie between the jobs/jobruns? But I will definitely also verify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drat, good call -- if the tron daemon starts with a bad config (because we can't send a bad config thru the api anymore while it's running) and skips the job on startup, then if someone pushes a correction, tron will re-start from 0, I'm guessing because recovery from state would only be done at startup time, and not when we re-initialize the job scheduler.
I'll have a dive in to the scheduler to see whether recovery at scheduling time is easy to achieve, otherwise I've half a mind to have to leave this as an Expected Behavior if you've sent an impossible schedule.
We ran into an infinite loop in our current TimeSpecification.get_match when asked for the next 30th of februrary, for example.
Other similar impossible job schedules such as April 31 would have run into the same issue. I considered changing this in the schedule parsing logic, but this will catch if (in the extremely slim chance) someone also specified an impossible date using our complex scheduling format (
groc daily,value: 30th day in feb).In this PR, we now do not allow an invalid TimeSpecificaation to be created, and when initially scheduling the next job runs, we check whether we get a valid TimeSpecification back, and if not, simply print an error in our logs and skip that job in our JobCollection.
In general, I do think our yelpsoa-configs validation should prevent us from even getting here again and will provide a better interface to job owners that show the invalid config, but if config on disk/in memory was ever able to get into a bad state like this, these changes would allow tron to continue gracefully, simply skipping that job with an invalid schedule. In addition, config change requests would succeed if provided with a new, valid schedule.
In addition to the TimeSpecification tests, I've tested locally and ensured if tron starts w/ an invalid schedule, it simply skips:
One unintuitive behavior as a result of this change is that the job will also no longer be shown in tronweb, though it persists in config, which could potentially lead to confusion. That said, I think trying to add the job but ensure everything down the line can properly handle+display appropriate warnings when it has an impossible schedule may be more trouble than this edge case is worth, tbh. I'm happy to give it a shot though if ppl think the opposite.