Skip to content

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949

Open
graphcareful wants to merge 2 commits intovectordotdev:masterfrom
graphcareful:rob/fix-disk-buffer-reload-bug
Open

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949
graphcareful wants to merge 2 commits intovectordotdev:masterfrom
graphcareful:rob/fix-disk-buffer-reload-bug

Conversation

@graphcareful
Copy link
Contributor

@graphcareful graphcareful commented Mar 17, 2026

Summary

This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until batch.timeout_secs has completed which causes buffers to flush and unblocks the reload process. The fix for this is to send the cancel() signal to the sink, so it doesn't block on its buffer not being flushed downstream.

The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the aws_s3 sink. I employed the same solution there in the stop() method - to call the cancel() signal early in the stop() method.

Vector configuration

sources:
  http-server-source:
    type: "http_server"
    address: "127.0.0.1:8001"

sinks:
  amazon-s3-sink:
    inputs:
      - "http-server-source"
    type: "aws_s3"
    bucket: e2e-tests
    region: us-east-1
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 30000000000
      when_full: block
    batch:
      timeout_secs: 500
      max_bytes: 100000000

How did you test this PR?

By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

- This fixes a bug where Vector waited until `batch.timeout_secs` to
complete a configuration reload, which can cause errors if this value is
large as buffers will start to overflow at the source.

- Fixes: vectordotdev#24948
- This fixes an issue observed after shutdown signal (ctrl-c) is sent
and Vector must ungracefully shutdown after its 60s grace period.
- It was discovered the issue was the s3 sink waiting until
`batch.timeout_secs` to flush its buffer, which only after completed
lets shutdown proceed. Naturally if this value is high the 60s graceful
shutdown timeout will expire first.
- The solution is to send the shutdown trigger first to the sinks so
that they flush their partial batches.
@graphcareful graphcareful requested a review from a team as a code owner March 17, 2026 19:05
@graphcareful graphcareful added domain: topology Anything related to Vector's topology code domain: config Anything related to configuring Vector domain: buffers Anything related to Vector's memory/disk buffers labels Mar 17, 2026
@pront
Copy link
Member

pront commented Mar 17, 2026

@codex review

@bruceg bruceg added the type: bug A code related bug. label Mar 17, 2026
Comment on lines +152 to +153
// Signal all sinks to flush their partial batches and stop accepting
// input. Without this, sinks rely on the natural stream EOF which only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit about "stop accepting input" makes me more than a little nervous. Could this cause problems if upstream sources send more events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my manual testing before this change vector would stall on shutdown. With this change it takes a few seconds but I believe that is the time it takes to complete a round trip request of the payload to S3 (was a little less than 100MB).

I believe data will still be in the channel, orphaned. This data will be lost on shutdown, however this already occurs if the 60s timeout expires. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory the best option would be to stop accepting input, drain all channels, then await until disk flushes have completed, then proceed with rest of shutdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering dropping this commit as its not really related to the original reported issue anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering dropping this commit

I agree with this. We should keep this fix as short as possible.

@pront
Copy link
Member

pront commented Mar 18, 2026

@codex review

Hmm, unfortunately this integration failed. This fix raises some concerns which I believe are also captured by the session below.

Codex Session

For issue #24948, I’d aim for a narrower fix than PR 24949.

The issue context is: on reload, a changed sink with a disk buffer gets its fanout input Paused, then Vector waits for that sink to shut down. Because the buffer sender is still alive, the sink never sees EOF, so it only flushes its partial batch when batch.timeout_secs expires. That’s why the pipeline stalls for one batch timeout. The PR fixes the stall by canceling the sink input stream, but that changes behavior to potentially drop queued events.

A better fix is:

During reload of changed sinks, stop using Pause for sinks we are about to wait on.
Instead, fully disconnect the old sink from fanout so it sees a real EOF and flushes immediately.
Route new traffic to the replacement sink as soon as it is ready, instead of holding the whole pipeline behind the old sink’s batch timeout.
Only wait synchronously when there is a true resource conflict.
For the disk-buffer reuse case, prefer correctness over channel reuse:
either don’t reuse the live (tx, rx) pair for disk-buffered sinks on reload, or add a real “flush current batch” control path that does not sever input.
In plain terms: the old sink should drain and exit naturally; the new sink should start taking new events; and we should stop trying to make “reuse the exact same live buffer channel” carry the whole reload design.

So my concrete recommendation would be:

Short-term safe fix

Don’t reuse disk-buffer channels across reload for changed sinks.
Disconnect old sink with EOF semantics, let it flush and finish, and start the new sink with a fresh buffer.
This should fix #24948 without introducing loss.

Longer-term better design

Add an explicit sink lifecycle signal like “flush_partial_batch_now” separate from “stop accepting input”.
Then reload/shutdown can flush partial batches without using take_until_if(tripwire) as a blunt instrument.

Why I prefer that over PR 24949:

It addresses the actual stall described in #24948.
It preserves graceful semantics.
It avoids turning reload/shutdown into a lossy path.
If you want, I can sketch the exact shutdown_diff / remove_inputs changes I’d make in src/topology/running.rs.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb5630a0fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +156 to +157
for (_, trigger) in self.detach_triggers {
trigger.into_inner().cancel();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve graceful draining during global shutdown

Canceling every sink detach_trigger at the top of RunningTopology::stop forces all sink input streams (take_until_if(tripwire)) to end before shutdown_coordinator.shutdown_all(...) starts upstream shutdown, so sinks can terminate while events are still buffered or in flight. That changes graceful shutdown semantics into a lossy path for any pipeline with pending events, because those events are never drained by the sink.

Useful? React with 👍 / 👎.

Comment on lines +666 to +667
if let Some(trigger) = self.detach_triggers.remove(key) {
trigger.into_inner().cancel();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Let changed sinks drain instead of canceling their input

In RunningTopology::shutdown_diff, this new branch cancels the old sink trigger for wait_for_sinks before waiting for task completion, which makes the sink stop via take_until_if(tripwire) rather than EOF from drained inputs. For changed sinks whose buffers are not being reused, queued events in the old buffer are abandoned when the old component is torn down, so reloads under resource conflict can now drop data instead of draining naturally.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: buffers Anything related to Vector's memory/disk buffers domain: config Anything related to configuring Vector domain: topology Anything related to Vector's topology code type: bug A code related bug.

Projects

None yet

3 participants