fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers by graphcareful · Pull Request #24949 · vectordotdev/vector

graphcareful · 2026-03-17T19:05:53Z

Summary

This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until batch.timeout_secs has completed which causes buffers to flush and unblocks the reload process. The fix for this is to send the cancel() signal to the sink, so it doesn't block on its buffer not being flushed downstream.

The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the aws_s3 sink. I employed the same solution there in the stop() method - to call the cancel() signal early in the stop() method.

Vector configuration

sources:
  http-server-source:
    type: "http_server"
    address: "127.0.0.1:8001"

sinks:
  amazon-s3-sink:
    inputs:
      - "http-server-source"
    type: "aws_s3"
    bucket: e2e-tests
    region: us-east-1
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 30000000000
      when_full: block
    batch:
      timeout_secs: 500
      max_bytes: 100000000

How did you test this PR?

By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

- This fixes a bug where Vector waited until `batch.timeout_secs` to complete a configuration reload, which can cause errors if this value is large as buffers will start to overflow at the source. - Fixes: vectordotdev#24948

- This fixes an issue observed after shutdown signal (ctrl-c) is sent and Vector must ungracefully shutdown after its 60s grace period. - It was discovered the issue was the s3 sink waiting until `batch.timeout_secs` to flush its buffer, which only after completed lets shutdown proceed. Naturally if this value is high the 60s graceful shutdown timeout will expire first. - The solution is to send the shutdown trigger first to the sinks so that they flush their partial batches.

pront · 2026-03-17T19:07:32Z

@codex review

bruceg · 2026-03-17T19:11:40Z

src/topology/running.rs

+        // Signal all sinks to flush their partial batches and stop accepting
+        // input. Without this, sinks rely on the natural stream EOF which only


The bit about "stop accepting input" makes me more than a little nervous. Could this cause problems if upstream sources send more events?

In my manual testing before this change vector would stall on shutdown. With this change it takes a few seconds but I believe that is the time it takes to complete a round trip request of the payload to S3 (was a little less than 100MB).

I believe data will still be in the channel, orphaned. This data will be lost on shutdown, however this already occurs if the 60s timeout expires. Thoughts?

In theory the best option would be to stop accepting input, drain all channels, then await until disk flushes have completed, then proceed with rest of shutdown

I'm considering dropping this commit as its not really related to the original reported issue anyway

I'm considering dropping this commit

I agree with this. We should keep this fix as short as possible.

pront · 2026-03-18T13:48:07Z

@codex review

Hmm, unfortunately this integration failed. This fix raises some concerns which I believe are also captured by the session below.

Codex Session

For issue #24948, I’d aim for a narrower fix than PR 24949.

The issue context is: on reload, a changed sink with a disk buffer gets its fanout input Paused, then Vector waits for that sink to shut down. Because the buffer sender is still alive, the sink never sees EOF, so it only flushes its partial batch when batch.timeout_secs expires. That’s why the pipeline stalls for one batch timeout. The PR fixes the stall by canceling the sink input stream, but that changes behavior to potentially drop queued events.

A better fix is:

During reload of changed sinks, stop using Pause for sinks we are about to wait on.
Instead, fully disconnect the old sink from fanout so it sees a real EOF and flushes immediately.
Route new traffic to the replacement sink as soon as it is ready, instead of holding the whole pipeline behind the old sink’s batch timeout.
Only wait synchronously when there is a true resource conflict.
For the disk-buffer reuse case, prefer correctness over channel reuse:
either don’t reuse the live (tx, rx) pair for disk-buffered sinks on reload, or add a real “flush current batch” control path that does not sever input.
In plain terms: the old sink should drain and exit naturally; the new sink should start taking new events; and we should stop trying to make “reuse the exact same live buffer channel” carry the whole reload design.

So my concrete recommendation would be:

Short-term safe fix

Don’t reuse disk-buffer channels across reload for changed sinks.
Disconnect old sink with EOF semantics, let it flush and finish, and start the new sink with a fresh buffer.
This should fix #24948 without introducing loss.

Longer-term better design

Add an explicit sink lifecycle signal like “flush_partial_batch_now” separate from “stop accepting input”.
Then reload/shutdown can flush partial batches without using take_until_if(tripwire) as a blunt instrument.

Why I prefer that over PR 24949:

It addresses the actual stall described in #24948.
It preserves graceful semantics.
It avoids turning reload/shutdown into a lossy path.
If you want, I can sketch the exact shutdown_diff / remove_inputs changes I’d make in src/topology/running.rs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb5630a0fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T13:51:45Z

src/topology/running.rs

+        for (_, trigger) in self.detach_triggers {
+            trigger.into_inner().cancel();


Preserve graceful draining during global shutdown

Canceling every sink detach_trigger at the top of RunningTopology::stop forces all sink input streams (take_until_if(tripwire)) to end before shutdown_coordinator.shutdown_all(...) starts upstream shutdown, so sinks can terminate while events are still buffered or in flight. That changes graceful shutdown semantics into a lossy path for any pipeline with pending events, because those events are never drained by the sink.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T13:51:45Z

src/topology/running.rs

+                if let Some(trigger) = self.detach_triggers.remove(key) {
+                    trigger.into_inner().cancel();


Let changed sinks drain instead of canceling their input

In RunningTopology::shutdown_diff, this new branch cancels the old sink trigger for wait_for_sinks before waiting for task completion, which makes the sink stop via take_until_if(tripwire) rather than EOF from drained inputs. For changed sinks whose buffers are not being reused, queued events in the old buffer are abandoned when the old component is torn down, so reloads under resource conflict can now drop data instead of draining naturally.

Useful? React with 👍 / 👎.

graphcareful added 2 commits March 17, 2026 14:58

graphcareful requested a review from a team as a code owner March 17, 2026 19:05

graphcareful added domain: topology Anything related to Vector's topology code domain: config Anything related to configuring Vector domain: buffers Anything related to Vector's memory/disk buffers labels Mar 17, 2026

bruceg added the type: bug A code related bug. label Mar 17, 2026

bruceg reviewed Mar 17, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949
graphcareful wants to merge 2 commits intovectordotdev:masterfrom
graphcareful:rob/fix-disk-buffer-reload-bug

graphcareful commented Mar 17, 2026 •

edited

Loading

Uh oh!

pront commented Mar 17, 2026

Uh oh!

bruceg Mar 17, 2026

Uh oh!

graphcareful Mar 17, 2026

Uh oh!

graphcareful Mar 17, 2026

Uh oh!

graphcareful Mar 17, 2026

Uh oh!

pront Mar 18, 2026

Uh oh!

pront commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Signal all sinks to flush their partial batches and stop accepting
		// input. Without this, sinks rely on the natural stream EOF which only

		for (_, trigger) in self.detach_triggers {
		trigger.into_inner().cancel();

		if let Some(trigger) = self.detach_triggers.remove(key) {
		trigger.into_inner().cancel();

Conversation

graphcareful commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

pront commented Mar 17, 2026

Uh oh!

bruceg Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

graphcareful Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

graphcareful Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

graphcareful Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pront Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

pront commented Mar 18, 2026

Codex Session

A better fix is:

Short-term safe fix

Longer-term better design

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

graphcareful commented Mar 17, 2026 •

edited

Loading