Improve OTP shutdown behavior for consumers by nsweeting · Pull Request #103 · nsweeting/rabbit

nsweeting · 2026-03-25T13:55:57Z

Summary

Improves the consumer shutdown sequence to follow proper OTP conventions. Previously, incorrect child spec types, missing consumption cancellation, and lack of graceful drain logic meant that a SIGTERM (e.g. k8s pod termination) would result in abrupt process kills rather than an orderly shutdown.

Problems Fixed

Consumer.Supervisor was typed as :worker with a 5s shutdown — the entire consumer subtree had only 5 seconds before being :killed. Now correctly typed as :supervisor with :infinity shutdown, allowing children to drain properly.
No consumption cancellation during shutdown — the broker continued delivering new messages while workers were shutting down. Now calls AMQP.Basic.cancel first.
Workers stopped sequentially — on an N-core machine with N workers, sequential DynamicSupervisor.stop calls could easily exceed the shutdown budget. Now stops all workers in parallel.
Executer had no terminate/2 — when shut down externally, the spawned message-processing process was killed with no opportunity to finish. Now uses Task.async + Task.shutdown/2 to give in-flight work a 5s grace period before escalating, and nacks with requeue: true as a safety net.
No explicit shutdown timeouts — all processes used OTP defaults (5s). Now uses a proper timeout chain: Consumer.Supervisor (:infinity) > Consumer.Server (30s) > Workers/Executers (25s).

Corrected Shutdown Sequence

SIGTERM
  -> Consumer.Supervisor gets :shutdown (timeout: :infinity)
    -> Each Consumer.Server gets :shutdown (timeout: 30s)
      -> terminate/2:
        1. AMQP.Basic.cancel — stop new message delivery
        2. Stop all Workers in parallel (25s timeout)
           -> Each Executer.terminate/2:
              a. Task.shutdown(task, 5_000) — grace period for in-flight work
              b. Safety-net nack with requeue: true if not completed
        3. AMQP.Channel.close — clean channel shutdown
  -> Producer.Pool, Topology.Server, Connection.Pool shut down after

Changes

lib/rabbit/broker/supervisor.ex — Consumer.Supervisor child spec gets type: :supervisor
lib/rabbit/consumer/supervisor.ex — Consumer.Server child specs get shutdown: 30_000
lib/rabbit/consumer/server.ex — Add cancel_consumer/1, parallel stop_workers/1
lib/rabbit/consumer/executer.ex — Refactor to Task.async, add terminate/2 with grace period and safety-net nack, shutdown: 25_000, completed state tracking
mix.exs — Version bump to 0.22.0

- Set Consumer.Supervisor child spec type to :supervisor so it gets shutdown: :infinity instead of being killed after 5 seconds - Set explicit shutdown: 30_000 on each Consumer.Server child spec - Cancel AMQP consumption (Basic.cancel) before stopping workers so no new messages arrive during drain - Stop workers in parallel instead of sequentially to fit within the shutdown budget - Add terminate/2 to Executer that nacks unfinished messages with requeue: true, preventing unacked messages from accumulating on quorum queues - Set Executer child_spec shutdown: 25_000 to give in-flight messages time to complete before the safety-net nack - Bump version to 0.22.0

Replace spawn_link with Task.async so terminate/2 can use Task.shutdown/2 to give in-flight messages a 5s grace period to complete before escalating to :kill and safety-net nacking. Previously the spawned process was killed immediately on shutdown with no chance to finish. Now the sequence is: 1. Task.shutdown(task, 5_000) - sends :shutdown, waits 5s 2. If task doesn't finish, brutally kills it 3. Safety-net nack with requeue: true

Check Task.shutdown/2 return value in terminate/2. If it returns {:ok, _}, the task finished within the grace period and already acked/nacked inside its body - skip the redundant nack.

- Executer lifecycle: normal completion, crash error handling, timeout - Executer terminate/2: shutdown with in-flight task, task-never-started edge case, skip nack when task completes before shutdown - Child spec assertions: Executer shutdown/restart, Consumer.Server shutdown timeout of 30_000

nsweeting added 2 commits March 25, 2026 09:27

nsweeting changed the title ~~fix: improve OTP shutdown behavior for consumers~~ Improve OTP shutdown behavior for consumers Mar 25, 2026

nsweeting added 2 commits March 25, 2026 10:00

fix: only safety-net nack when task did not complete

a4217e8

Check Task.shutdown/2 return value in terminate/2. If it returns {:ok, _}, the task finished within the grace period and already acked/nacked inside its body - skip the redundant nack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve OTP shutdown behavior for consumers#103

Improve OTP shutdown behavior for consumers#103
nsweeting wants to merge 4 commits intomasterfrom
cleanup-shutdown

nsweeting commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nsweeting commented Mar 25, 2026

Summary

Problems Fixed

Corrected Shutdown Sequence

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant