-
Notifications
You must be signed in to change notification settings - Fork 140
healthcheck feature #599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
healthcheck feature #599
Conversation
|
I think this won't work for the reasons #598 doesn't - no support for swapping timers to transition startup healthchecks to regular healthchecks. |
|
Ephemeral COPR build failed. @containers/packit-build please check. |
|
Thanks for taking a look mheon! its really appreciated. I've made the changes to handle healthchecks in starting phase. Let me know your feedback! Thanks |
|
I've made a PR on podman side: It would read from the conmon's pipe: containers/podman#27067 |
|
I'm a bit hesitant about this one because we are strongly contemplating a Conmon rewrite to be released alongside Podman 6 in May - containers/podman#27053 - so additional work on the existing, C conmon seems a bit wasted in light of that. But if this can be done for our November release of 5.7 and that's of use to you, I'm not entirely opposed? |
src/healthcheck.c
Outdated
| } | ||
| if (!cJSON_IsArray(test_array) || cJSON_GetArraySize(test_array) < 2) { | ||
| nwarn("Healthcheck configuration missing required 'test' command"); | ||
| cJSON_Delete(json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healthcheck_config_free(config); is missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config is allocated on stack in conmon.c and later freed with healthcheck_config_free, because of some inner attributes allocated in heap. The free function was not at the right place though and changed it.
src/healthcheck.c
Outdated
|
|
||
| if (!cJSON_IsString(cmd_type) || !cJSON_IsString(cmd_value)) { | ||
| nwarn("Healthcheck test command must be an array of strings"); | ||
| cJSON_Delete(json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healthcheck_config_free(config); is missing here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldnt free there AFAIU, same reason from above
src/healthcheck.c
Outdated
|
|
||
| /* Parse Interval (now in seconds) */ | ||
| if (strcmp(cmd_type->valuestring, "CMD") == 0 || strcmp(cmd_type->valuestring, "CMD-SHELL") == 0) { | ||
| /* Create test command array */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please validate the command string here, e.g. strlen(cmd_value->valuestring) == 0 || strlen(cmd_value->valuestring) > 4096 ?
src/healthcheck.c
Outdated
| if (strcmp(cmd_type->valuestring, "CMD") == 0 || strcmp(cmd_type->valuestring, "CMD-SHELL") == 0) { | ||
| /* Create test command array */ | ||
| config->test = calloc(2, sizeof(char*)); | ||
| config->test[0] = strdup(cmd_value->valuestring); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test calloc and strdup for failure here.
src/healthcheck.c
Outdated
| config->test[1] = NULL; | ||
| } else { | ||
| nwarnf("Unsupported healthcheck command type: %s (only CMD and CMD-SHELL supported)", cmd_type->valuestring); | ||
| cJSON_Delete(json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing healthcheck_config_free(config);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reason from above
src/healthcheck.c
Outdated
| cJSON *interval = cJSON_GetObjectItem(json, "interval"); | ||
| if (cJSON_IsNumber(interval)) { | ||
| config->interval = (int)interval->valuedouble; | ||
| if (!cJSON_IsNumber(interval) || interval->valuedouble <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add here also check for the max value here?
src/healthcheck.c
Outdated
| cJSON *timeout = cJSON_GetObjectItem(json, "timeout"); | ||
| if (cJSON_IsNumber(timeout)) { | ||
| config->timeout = (int)timeout->valuedouble; | ||
| if (!cJSON_IsNumber(timeout) || timeout->valuedouble <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also max value check here please + missing healthcheck_config_free(config); again
src/healthcheck.c
Outdated
| cJSON *start_period = cJSON_GetObjectItem(json, "start_period"); | ||
| if (cJSON_IsNumber(start_period)) { | ||
| config->start_period = (int)start_period->valuedouble; | ||
| if (!cJSON_IsNumber(start_period) || start_period->valuedouble < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also max value check here please.
999af70 to
040f3e7
Compare
|
integration tests failed for cri-o. Would it be possible that its some race condition? I don't see how this is related to the changes in this PR 🤔 |
|
I was able to run the "crio restore with missing config.json" test locally: its unclear what happened but I have the feeling that running it again could pass? Do we experience flaky in CI sometimes? @jnovy Sorry to bother you again, could you run the integration test if you get a chance? |
|
You are right @samifruit514 - integration failures don't seem related to the PR. Now let's simplify the code as it is super-bloated in the current state - we can remove about two thirds of the code still maintaining the functionality and also drop the JSON parsing functionality completely:
// Use GLib timeout source instead of pthread
healthcheck_timer_id = g_timeout_add_seconds(healthcheck_interval,
healthcheck_timer_callback, NULL);
Then error handling and memory management can be significantly simplified. Also, there is no test coverage but let's cover it once the above is done - and code simplified. Does it make sense @samifruit514 ? |
|
Awsome. makes sense! I will do the changes |
f165d2f to
efb4380
Compare
|
Thanks, there are merge conflicts in |
0cbd85f to
65eba53
Compare
|
Indeed, conflicts 😅. I still need to get more comfortable with PRs against upstreams. They are now resolved, thanks As for the hash_table function, I cleaned it up—it was just leftover code from when I mistakenly thought conmon handled multiple containers 🤦. I also reviewed the rest of the code with my best effort (and my somewhat limited C knowledge, plus a little AI help 😉) to make sure everything is as clean as possible. |
|
Sorry to bother you again @jnovy , do you have the time to review the changes? |
jnovy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BATS tests primarily cover CLI parsing but miss:
- Timeout enforcement testing
- Start period transition logic
- Error handling scenarios
- Resource cleanup verification
9e63335 to
c68bc82
Compare
jnovy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much better now. FYI - I will be AFK for the next two weeks..
Signed-off-by: Samuel Archambault <samuel.archambault@getmaintainx.com>
|
@jnovy :Sorry to bother you, once again! do you have the time to review the changes? |
|
Personally I don't see the point for this, we are in the process rewriting in rust so I rather not add large features here that increase the scope of the rewrite. But also more importantly I don't see how this can help podman here, from a quick look this is calling the oci runtime directly instead of the proper
Podman is not a daemon, If we want to add healtchcheck to conmon then I think we first need a proper design discussion. |
|
hey @Luap99 ! thanks for your feedback. Really appreciated! How about this: @jnovy a bit different to what we made so far, but it is decoupled from podman. what do you think? |
|
I'm back @samifruit514 - I agree with @Luap99 that this feature deserves proper design discussion so that conmon and podman interaction is well thought through and functional. On the other hand I don't necessarily think that discussion and implementation can't happen here - involving @jankaluza - as the feature can land sooner than the time allows in the Rust reimplementation. Shall we start the design discussion now? |
context:
We are running a multi-container application with docker-compose, but through podman (
podman system service --log-level debug unix:///tmp/podman.sock). The apps definitions inside the docker-compose.yml file contains a bunch of health checks and dependencies. Since we run that in a CI, WITHOUT systemd, there is no healthchecks (no unit are created for healthchecks). Because of that, the multi-container app doesnt run.According to podman people, conmon should handle health checks, or at least, conmon would be a great candidate to do the healthchecks.
This PR accepts
--enable-healthcheckin conmon args, enabling the healthchecks from conmon to podman, through the unix pipe (the same one that sends the PID to link).For more info on healthcheck handling by podman: containers/podman#27033