Skip to content

Conversation

@supervacuus
Copy link
Collaborator

@supervacuus supervacuus commented Nov 10, 2025

This is a more elaborate, long-term fix to getsentry/sentry-java#4830 than #1444.

It also finishes the work done here: #1088
And fixes the issues raised here: #1353
and here: #906

So, while the driver for this PR is a downstream issue that exposes the signal-unsafety of some parts of the current inproc implementation, it also addresses a much broader range of concerns that regularly affect inproc users on all platforms.

At a high level, it introduces a separate handler thread for inproc, which the signal handler (or UEF on Windows) wakes after it exchanges crash context data.

The idea is that we minimize signal handler/UEF to do the least amount of syscall stuff (or at least the subset documented in the signal-safety man-page), while the handler thread can execute functions outside that range (with limitations, since thread sync and heap allocations are still problematic). This allows us to reuse stdio functionality like formatters without running squarely into UB territory or having to rewrite all utilities to async-signal-safe versions, as in #1444.

There are a few considerable changes to mention:

  • since we run the event construction in a separate handler thread, the use of backtrace() or any unwinder that runs from the "current" instruction address is entirely useless (ignoring the fact that backtrace() was always signal-unsafe to begin with, which itself was the source of crashes, hangs or just empty stack traces).
  • this means we require a "user context"-based stack walker in inproc, which we already partially acknowledged in Using libunwind for mac, since backtrace do not expect thread context… #1088 and fix: support musl on Linux #1233.
  • on Linux, this PR requires libunwind (the nognu implementation, not the llvm one, which is a pure C++ exception unwinder), which is a breaking change (at least in the sense that users now require an additional dependency at build and runtime). This means that the "general" Linux usage is now the same as with the musl libc environments.
  • on macOS, we provide a user context stack-walker based on frame pointer records for arm64 and x86-64, and use the system-provided libunwind for the default stack-trace from a call-site. It turned out that the system-provided libunwind wasn't safe enough to use in the context of the signal handler (either led to hangs or had issues with escaping the trampoline). This means users can now use inproc on macOS again (if their code is compiled without omitting frame pointers, which is always the case by default on macOS).

Further improvements/fixes (summarizing the 30 commits, which I didn't want to squash):

  • the libunwind-based unwinder modules now also validate retrieved ucontext pointers against memory mapping (for Linux and macOS)
  • got rid of all remaining __sync functions and replaced them with __atomic (especially the signal handler blocking logic and the spinlock)
  • rectified the inconsistent usage of C++ new with std::nothrow throughout the affected backend code (including the initialization of crashpad_state_t, which still used malloc and memset although it has std::atomic members)
  • cleaned up the CMake configure phase of the integration test suite.
  • ensures that test fixtures do not end up in macOS bundles
  • fixes build issues with by-default PIE and LTO builds
  • musl is no longer a special case "Linux" in the build script
  • fixes a couple of warnings and test-case instabilities
  • introduce macos-26 build config

TODOs:

  • finish the x86-64 stackwalker for macOS (and clean up the code)
  • Figure out if we need the libbacktrace fallback at all and how to handle it.
  • provide a module-level description of the new mechanism in inproc
  • Decide on having the change
  • Update documentation
    • Advanced usage might be outdated wrt to signal handling of inproc
    • Remove mentions of inproc not working on macOS
    • Clarify the new libunwind dependency on Linux

* use `std::nothrow` `new` consistently to keep exception-free semantics for allocation
* rename static crashpad_handler to have no module-public prefix
* use `nullptr` for arguments where we previously used 0 to clarify that those are pointers
* eliminate the `memset()` of the `crashpad_state_t` initialization since it now contains non-trivially constructable fields (`std::atomic`) and replace it with `new` and an empty value initializer.
…ld, since libraries like libunwind.a might be packaged without PIC.
…ms with architecture prefixes (32-bit Linux)
…stack

also ensure to get the first frame
harmonize libunwind usage
…eader in the libunwind walker for Linux and log as much as possible to understand where the actual crash happens
…nd running the deferred code directly inside the signal handler. Nothing changes for them.
…phore on the return channel and let the OS block and wait.

Also check the return value of startup_handler_thread in the initialization and propagate the failure.
…rancy guard

* up to now, we've been serializing signal handling even though we didn't know whether it was a runtime signal or one we should be handling
* this meant that we blocked all our critical sections during a managed exception
* it also meant that we blocked any concurrent managed exceptions
* it also meant that we introduced a race window during the time when we chained, because incoming signal on other threads would have gotten next in line, before we even completed the current signal handler

by moving it completely outside our synchronization we truly chain at start and don't interfere until we know we must.
@supervacuus supervacuus marked this pull request as ready for review November 20, 2025 10:52
}
if (sentry__atomic_fetch(&g_handler_should_exit)) {
break;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Handler thread exits without processing crash

The handler thread checks g_handler_should_exit immediately after waking from the semaphore, before checking g_handler_has_work. If shutdown is initiated after the signal handler signals the semaphore but before the handler thread processes the work flag, the crash event will be lost because the thread exits without processing it. The same issue exists on UNIX at lines 833-835. The check for g_handler_should_exit needs to happen after verifying and processing any pending work to ensure crashes are never dropped during shutdown.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to discussion about this. I am a big fan of letting the shutdown request overrule any others. In this case, it is unlikely they will happen at the same time, but it would have to be either-or. As such, it isn't really a bug, but rather a policy decision.

@supervacuus supervacuus requested a review from vaind November 20, 2025 11:19
# endif

# ifdef SENTRY_PLATFORM_UNIX
sentry__enter_signal_handler();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Concurrent signal handlers corrupt shared handler state

On UNIX, dispatch_ucontext calls sentry__leave_signal_handler() at line 1064 before waiting for the handler thread to complete. This allows a second signal to arrive and call sentry__enter_signal_handler() successfully, then proceed to overwrite the global g_handler_state structure at lines 1042-1061 while the handler thread is still reading from it at line 853. The single global g_handler_state variable has no synchronization protecting concurrent access between multiple signal handlers and the handler thread, leading to potential data corruption when multiple crashes occur in quick succession across different threads.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, and was a recurring topic during the development of the changes in this PR. I first wanted to have feedback on how to proceed with the rest. The solution here will be a two-stage blocking mechanism, which I have successfully experimented with in previous commits. However, since the signal handler blocking must also support the other backend handlers, I wanted to have a first review.

@supervacuus
Copy link
Collaborator Author

@jpnurmi: I primarily added you regarding the chain-at-start handler strategy. The most significant change in that regard is that we no longer block anything when chaining at the start (see 6b6e545 for details).

@vaind: I primarily added you here because I know you consume inproc downstream and may also be affected by changes to the unwinder-to-platform mapping in the root CMake script. I don't think that any of the Windows changes will cause particular issues downstream, but differing build configurations could cause some pain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants