Skip to content

Improve AWS S3 source error logging and diagnostics#24959

Draft
saliagadotcom wants to merge 8 commits intovectordotdev:masterfrom
saliagadotcom:shend/improve-logging
Draft

Improve AWS S3 source error logging and diagnostics#24959
saliagadotcom wants to merge 8 commits intovectordotdev:masterfrom
saliagadotcom:shend/improve-logging

Conversation

@saliagadotcom
Copy link

Summary

The AWS S3 source has opaque error logging across several failure modes. Users see generic messages like "Failed to process SQS message: service error" with no AWS error codes, no actionable guidance, and incorrect error type classification (PARSER_FAILED for network/auth errors). The retry backoff loop logs at trace! level, making persistent failures invisible in production.

Changes

  • Shared AWS error extraction (src/aws/error.rs — new file): Reusable extract_error_context() and classify_error() utilities that pull AWS error codes, HTTP status, request IDs, and dispatch failure kind from any SdkError. Classifies errors into Auth/NotFound/Throttling/Connectivity/Configuration/ServiceError.
  • S3 GetObject error classification & non-retryable message deletion (sqs.rs, aws_sqs.rs): New S3ObjectGetFailed internal event with actionable messages per error kind (NoSuchKey, AccessDenied, NoSuchBucket, etc.). Non-retryable errors now delete the SQS message when delete_failed_message = true, stopping infinite retry loops for deleted/inaccessible objects. S3ObjectProcessingFailed promoted from debug! to warn! with key and error fields added.
  • Startup diagnostic logging (mod.rs): Log client_concurrency, compression, multiline, delete_failed_message, and acknowledgements at startup alongside existing fields. Improved Snafu display strings for ConfigMissing and InvalidNumberOfMessages.
  • WrongRegion error enrichment (sqs.rs): Now includes both the event region and the configured region with remediation guidance.
  • Deserialization error context (sqs.rs): S3 object bucket/key added to deserialize error warnings. Detects when all frames fail deserialization (0 events produced) and emits a dedicated warning.
  • SQS operational logging (sqs.rs): Backoff retry log promoted from trace! to warn! with queue_url. Poison message detection via ApproximateReceiveCount > 5. S3 test event promoted from debug! to info!. SNS envelope unwrapping logged at debug!.

Vector configuration

Standard AWS S3 source configuration with SQS-based notifications:

[sources.s3]
type = "aws_s3"
region = "us-east-1"
sqs.queue_url = "https://sqs.us-east-1.amazonaws.com/123456789/my-queue"
sqs.delete_failed_message = true
compression = "gzip"
acknowledgements.enabled = true

How did you test this PR?

  • make check-clippy
  • make check-fmt
  • make test

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Mar 18, 2026
@github-actions
Copy link
Contributor


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@pront
Copy link
Member

pront commented Mar 18, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25b6c6236a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

error_kind,
actionable_message: &actionable_message,
});
self.state.delete_failed_message && !class.is_retryable()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid deleting S3 messages on unclassified client errors

This branch deletes failed S3 notifications whenever delete_failed_message is enabled and class.is_retryable() is false, but the new classifier maps all non-whitelisted 4xx responses to RequestError (and other unmatched cases to Unknown), both treated as non-retryable. That means transient or newly introduced AWS 4xx error codes can now cause immediate SQS message deletion instead of retry, leading to permanent data loss for affected objects. Restrict auto-delete to explicitly terminal errors (for example AccessDenied/NoSuchKey/NoSuchBucket) and keep RequestError/Unknown retryable.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a message should only be deleted into two scenarios when successfully processed and when sink rejects the batch and delete_failed_message is enabled. I will revert this.

@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: external docs Anything related to Vector's external, public documentation domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants