-
Notifications
You must be signed in to change notification settings - Fork 0
[✨ Triage] dotnet/runtime#126251 by mus65 - Runtime logs to stderr if libgssapi_krb5 is not available #176
Description
Triage for dotnet/runtime#126251.
Repo filter: All networking issues.
MihuBot version: 246635.
Ping MihaZupan for any issues.
This is a test triage report generated by AI, aimed at helping the triage team quickly identify past issues/PRs that may be related.
Take any conclusions with a large grain of salt.
Tool logs
dotnet/runtime#126251: Runtime logs to stderr if libgssapi_krb5 is not available by mus65
Extracted 5 search queries: Cannot load library libgssapi_krb5.so.2 Error: libgssapi_krb5.so.2: cannot open shared object file: No such file or directory, dotnet runtime writes native library load errors to stderr when libgssapi_krb5 is missing, pal_gssapi.c System.Net.Security.Native logs libgssapi_krb5 load failure to stderr, Npgsql GSSAPI fallback causes spurious stderr messages when libgssapi_krb5 not installed, TypeInitializationException from missing libgssapi_krb5 but runtime still prints native load error to stderr
Found 21 candidate issues
Below are the most relevant prior PRs / issues I found and a short summary of each (what was discussed and any conclusions that matter for the new report about pal_gssapi.c writing to stderr when libgssapi_krb5 is missing).
PR #55037 (July 2021) - "Shim gss api on Linux to delay loading libgssapi_krb5.so"
- Summary: Introduced a native shim + on-demand dlopen() so the runtime no longer has a static link-time dependency on libgssapi_krb5. The goal was to tolerate containers that don't have krb5 installed and to avoid single-file regressions where an app would fail at process start even if it never used GSSAPI. The PR used a managed-side static constructor pattern (so initialization is delayed until the managed API is touched) and discussed concurrency/initialization approaches. This is the main change that removed the requirement for the library to be present at process startup; it is directly relevant because it’s the code-path that now does dynamic load attempts when the API is used.
- Conclusion relevant to new issue: the runtime moved to lazy-loading, so missing libgssapi_krb5 is expected in many container scenarios. However, the PR focused on delaying the crash/regression — it did not (in the PR discussion) eliminate diagnostic printing to stderr coming from the native shim when dlopen fails.
PR #59526 (Sept 2021) - "Fix krb5 library SO name in the gss api shim"
- Summary: Fixed the library name probed by the shim to use the runtime SONAME (libgssapi_krb5.so.2) rather than the versionless build-time .so. The discussion covered probe order and distro differences (MIT vs Heimdal), and trade-offs of probing the versionless .so.
- Conclusion relevant to new issue: changed which filenames the shim attempts, which affects when/why dlopen fails and thus what error text appears. The change was merged.
Issue #45720 (Dec 2020) - "Publishing release as single file does not include all libraries (libgssapi_krb5)"
- Summary: Users saw single-file apps failing at start with "libgssapi_krb5.so.2: cannot open shared object file" because single-file superhost linked native shims statically and pulled in dependencies eagerly. Discussion: single-file behavior vs. non-single-file; the resolution path was to delay-load such native dependencies. PR #55037 was later referenced as the fix for this class of single-file problem.
- Conclusion relevant to new issue: the single-file/startup failure was addressed by the lazy-loading shim; but the underlying dlopen failure message (or stderr output) remained a point of UX friction.
Issue #45682 (Dec 2020) - "Is it possible to remove unused native dll references from the published executable?"
- Summary: Same root cause area — native shims statically linked into single-file executables caused unexpected native dependencies (libgssapi_krb5). Discussion suggested delaying native initialization or making managed code trigger native init (static constructor pattern used in other shims).
- Conclusion relevant to new issue: team had the same intent (delay init) — PR #55037 implemented that for GSSAPI — but the logging/noise behavior when dlopen fails was not explicitly removed here.
Issue #11891 (Jan 2019) - "Better error message when not loading native shared library"
- Summary: Long discussion about native load diagnostics on Unix/macOS and improving error messages (use dlerror, recommend LD_DEBUG / DYLD_PRINT_LIBRARIES). The coreclr/runtime were changed to include better diagnostic guidance (e.g., suggestion to set LD_DEBUG) and to incorporate dlerror output into the managed DllNotFoundException message in the Unix path.
- Conclusion relevant to new issue: the runtime already improved the managed exception messages to include OS loader diagnostics. That helps callers, but it does not directly address native code printing to stderr (fprintf) — using dlerror in exception messages is a different mechanism than suppressing native stderr output.
Issue #82945 (Mar 2023) - "Alpine System.Net.Security.Tests failing because of 'Cannot load library libgssapi_krb5.so.2'"
- Summary: CI failure on Alpine where tests/logs contained "Cannot load library libgssapi_krb5.so.2". The test/CI noise was caused by missing libs in the test image; the team determined a prereq image update fixed the problem. This shows that the same stderr message (or similar) surfaces in CI logs and can be confusing/noisy.
- Conclusion relevant to new issue: there is precedent for the environment producing that exact message in logs and that it can cause CI/test noise/confusion.
Issue #109236 (Oct 2024) - "The type initializer for 'NetSecurityNative' throws meaningless exception rendering corrective action impossible."
- Summary: User hit TypeInitializationException coming from NetSecurityNative/GssInitializer on Unix; the stack showed an invalid state but message lacked actionable guidance. Discussion pointed to platform-specific causes and that GSS is OS-provided and troubleshooting depends on distro/package. That issue highlights the developer confusion when the managed exception is generic and the native diagnostics are sparse or ambiguous.
- Conclusion relevant to new issue: emphasizes that noisy stderr output + an unhelpful TypeInitializationException is a poor DX. It supports the request in the new issue to avoid printing confusing stderr messages when the absence of the library is an expected fallback condition.
PR #68253 (Apr 2022) - "Delete libkrb5-dev from NativeAOT prereqs"
- Summary: Because the runtime moved to dynamic loading of libgssapi_krb5 (PR #55037), the build prereq libkrb5-dev was removed from NativeAOT prerequisites. This confirms the dynamic-loading design decision is used to reduce hard build-time/runtime dependencies.
- Conclusion relevant to new issue: reinforces that the runtime now expects the library to possibly be missing on many systems, strengthening the case that noisily printing to stderr about it being missing is undesirable.
PR #70723 (June 2022) - "Fix compilation without HAVE_GSS_KRB5_CRED_NO_CI_FLAGS_X"
- Summary: Native build fixes / defensive code for platforms lacking a particular GSS/KRB5 feature macro. The PR and subsequent discussion also surfaced some platform-specific crashes in gss/ntlm stacks on some distros (not directly about logging).
- Conclusion relevant to new issue: shows ongoing platform-specific GSSAPI fragility and that native side can be sensitive to distro-specific behavior — another reason to avoid emitting confusing diagnostic noise in common scenarios.
EFCore issue / discussion: microsoft/efcore#33271 (Mar 2024) - "Running on kubernetes: Cannot load library libgssapi_krb5.so.2 ..."
- Summary: User saw the same "Cannot load library libgssapi_krb5.so.2" log lines in a Kubernetes pod which led to confusion while debugging their app; the root cause was unrelated (appsettings/cascading config or integrated-security setting). The issue was closed after the reporter clarified root cause. This is an example of how those stderr messages can mislead users during diagnosis.
- Conclusion relevant to new issue: stderr noise from GSSAPI can mislead users — another data point for suppressing or demoting such messages.
Overall findings and takeaways from the above:
- The runtime intentionally moved to lazy dlopen() for libgssapi_krb5 (PR #55037) so that missing GSSAPI libraries are a normal, expected condition on many minimal/container systems (rather than a hard startup failure). This is already in place.
- The shim probes the versioned SONAME (libgssapi_krb5.so.2) (PR #59526), so dlopen failures commonly report that exact filename.
- The runtime improved managed-side error messages to suggest diagnostic env vars (LD_DEBUG / DYLD_PRINT_LIBRARIES / dlerror) when a native load fails (issue #11891), but that does not suppress native code printing to stderr.
- Multiple user reports / CI hits show the exact stderr output ("Cannot load library libgssapi_krb5.so.2\nError: ... cannot open shared object file: No such file or directory") appears in logs and causes confusion (issues #45720, #82945, EFCore #33271).
- I did not find an issue/PR that specifically removed or changed the native shim's fprintf-to-stderr behavior in pal_gssapi.c. The prior discussions focused on lazy loading, filename probing, and improving managed exception text; none explicitly resolved the "noisy stderr print on dlopen fail" question.
If you want next steps for triage:
- The new issue points to pal_gssapi.c line(s) that print to stderr on dlopen failure — since the team already expects dlopen to commonly fail, it’s reasonable to change the native shim to avoid unconditional fprintf(stderr). Two obvious alternatives discussed in the historical threads are:
- Do not write to stderr from the native shim; instead capture dlerror() and surface it in the managed exception (like other DllNotFound changes did), or
- Make the fprintf a debug-only log (only when a debug/tracing flag is set) so normal containers aren't polluted.
- The lazy-loading PRs (PR #55037 and PR #59526) are the best reference for how the shim is structured; use them to implement/locate change points to suppress stderr writes while still preserving useful diagnostics for developers (e.g., include dlerror in the managed exception text).
If you'd like, I can:
- point to the exact lines in pal_gssapi.c that call fprintf / write to stderr (the issue referenced line ~125 already) and draft a minimal patch that uses dlerror() and returns an error string to managed code instead of printing, or
- open a short follow-up PR suggestion referencing the above prior work (PR #55037 / PR #59526 / issue #11891) recommending removing the stderr writes and propagating dlerror into the managed exception.