Skip to content

Improve OTP shutdown behavior for consumers#103

Open
nsweeting wants to merge 4 commits intomasterfrom
cleanup-shutdown
Open

Improve OTP shutdown behavior for consumers#103
nsweeting wants to merge 4 commits intomasterfrom
cleanup-shutdown

Conversation

@nsweeting
Copy link
Copy Markdown
Owner

Summary

Improves the consumer shutdown sequence to follow proper OTP conventions. Previously, incorrect child spec types, missing consumption cancellation, and lack of graceful drain logic meant that a SIGTERM (e.g. k8s pod termination) would result in abrupt process kills rather than an orderly shutdown.

Problems Fixed

  • Consumer.Supervisor was typed as :worker with a 5s shutdown — the entire consumer subtree had only 5 seconds before being :killed. Now correctly typed as :supervisor with :infinity shutdown, allowing children to drain properly.
  • No consumption cancellation during shutdown — the broker continued delivering new messages while workers were shutting down. Now calls AMQP.Basic.cancel first.
  • Workers stopped sequentially — on an N-core machine with N workers, sequential DynamicSupervisor.stop calls could easily exceed the shutdown budget. Now stops all workers in parallel.
  • Executer had no terminate/2 — when shut down externally, the spawned message-processing process was killed with no opportunity to finish. Now uses Task.async + Task.shutdown/2 to give in-flight work a 5s grace period before escalating, and nacks with requeue: true as a safety net.
  • No explicit shutdown timeouts — all processes used OTP defaults (5s). Now uses a proper timeout chain: Consumer.Supervisor (:infinity) > Consumer.Server (30s) > Workers/Executers (25s).

Corrected Shutdown Sequence

SIGTERM
  -> Consumer.Supervisor gets :shutdown (timeout: :infinity)
    -> Each Consumer.Server gets :shutdown (timeout: 30s)
      -> terminate/2:
        1. AMQP.Basic.cancel — stop new message delivery
        2. Stop all Workers in parallel (25s timeout)
           -> Each Executer.terminate/2:
              a. Task.shutdown(task, 5_000) — grace period for in-flight work
              b. Safety-net nack with requeue: true if not completed
        3. AMQP.Channel.close — clean channel shutdown
  -> Producer.Pool, Topology.Server, Connection.Pool shut down after

Changes

  • lib/rabbit/broker/supervisor.ex — Consumer.Supervisor child spec gets type: :supervisor
  • lib/rabbit/consumer/supervisor.ex — Consumer.Server child specs get shutdown: 30_000
  • lib/rabbit/consumer/server.ex — Add cancel_consumer/1, parallel stop_workers/1
  • lib/rabbit/consumer/executer.ex — Refactor to Task.async, add terminate/2 with grace period and safety-net nack, shutdown: 25_000, completed state tracking
  • mix.exs — Version bump to 0.22.0

- Set Consumer.Supervisor child spec type to :supervisor so it gets
  shutdown: :infinity instead of being killed after 5 seconds
- Set explicit shutdown: 30_000 on each Consumer.Server child spec
- Cancel AMQP consumption (Basic.cancel) before stopping workers so
  no new messages arrive during drain
- Stop workers in parallel instead of sequentially to fit within
  the shutdown budget
- Add terminate/2 to Executer that nacks unfinished messages with
  requeue: true, preventing unacked messages from accumulating on
  quorum queues
- Set Executer child_spec shutdown: 25_000 to give in-flight messages
  time to complete before the safety-net nack
- Bump version to 0.22.0
Replace spawn_link with Task.async so terminate/2 can use
Task.shutdown/2 to give in-flight messages a 5s grace period
to complete before escalating to :kill and safety-net nacking.

Previously the spawned process was killed immediately on shutdown
with no chance to finish. Now the sequence is:
1. Task.shutdown(task, 5_000) - sends :shutdown, waits 5s
2. If task doesn't finish, brutally kills it
3. Safety-net nack with requeue: true
@nsweeting nsweeting changed the title fix: improve OTP shutdown behavior for consumers Improve OTP shutdown behavior for consumers Mar 25, 2026
Check Task.shutdown/2 return value in terminate/2. If it returns
{:ok, _}, the task finished within the grace period and already
acked/nacked inside its body - skip the redundant nack.
- Executer lifecycle: normal completion, crash error handling, timeout
- Executer terminate/2: shutdown with in-flight task, task-never-started
  edge case, skip nack when task completes before shutdown
- Child spec assertions: Executer shutdown/restart, Consumer.Server
  shutdown timeout of 30_000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant