Skip to content

RFC: Add AWS Observability plugin #67

@theagenticguy

Description

@theagenticguy

Is this related to an existing feature request or issue?

Based on the existing AWS Observability Kiro Power, adapted into the agent-plugins marketplace format.

Summary

This RFC proposes a new aws-observability plugin that provides a comprehensive AWS observability platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, and automated codebase observability gap analysis. The plugin integrates four MCP servers from AWS Labs and provides eight reference files covering incident response, log analysis, alerting setup, performance monitoring, security auditing, observability gap analysis, Application Signals enablement, and CloudTrail data source selection.

Use case

AI coding agents today lack integrated access to AWS observability tooling. When developers need to troubleshoot production incidents, analyze logs, monitor performance, audit security events, or assess codebase observability gaps, they must manually switch between multiple AWS consoles and tools.

Key use cases:

  • Incident response: Quickly triage production incidents by correlating alarms, logs, traces, metrics, and recent changes across CloudWatch, Application Signals, and CloudTrail
  • Log analysis: Query CloudWatch Logs using Logs Insights syntax with pattern detection, anomaly analysis, and multi-log-group support
  • Performance monitoring: Monitor microservices health via Application Signals APM with SLOs, distributed tracing, and service dependency maps
  • Security auditing: Investigate security incidents and perform compliance audits using CloudTrail with a prioritized data source strategy (Lake > CloudWatch Logs > Lookup Events API)
  • Alerting setup: Configure intelligent CloudWatch alarms using AWS best-practice recommendations with composite alarms and anomaly detection
  • Observability gap analysis: Audit codebases across Python, Java, JavaScript/TypeScript, Go, Ruby, and C#/.NET for missing logging, metrics, tracing, error handling, and health checks

Proposal

Plugin structure

plugins/aws-observability/
├── .claude-plugin/
│   └── plugin.json            # Plugin manifest
├── .mcp.json                  # 4 MCP server definitions
└── skills/
    └── aws-observability/
        ├── SKILL.md           # Main skill (~155 lines, auto-triggers)
        └── references/
            ├── alerting-setup.md
            ├── application-signals-setup.md
            ├── cloudtrail-data-source-selection.md
            ├── incident-response.md
            ├── log-analysis.md
            ├── observability-gap-analysis.md
            ├── performance-monitoring.md
            └── security-auditing.md

MCP servers

Server Type Purpose
awslabs.cloudwatch-mcp-server stdio CloudWatch Logs, Metrics, Alarms, log group analysis
awslabs.cloudwatch-applicationsignals-mcp-server stdio Application Signals APM, SLOs, distributed tracing
awslabs.cloudtrail-mcp-server stdio CloudTrail security auditing, API activity tracking
awslabs.aws-documentation-mcp-server stdio Official AWS documentation search and access

Skill design

The SKILL.md follows progressive disclosure:

  • Initial load (~155 lines): Prerequisites, configuration, capability overview, reference file index with load conditions, quick start examples, essential log query patterns, and best practices
  • On-demand references (8 files): Loaded only when the agent needs deep domain knowledge for a specific workflow (e.g., incident response, security auditing)

User experience

Before: Users must manually navigate AWS Console, run CLI commands, and context-switch between CloudWatch, X-Ray, CloudTrail, and documentation.

After: Users describe their intent naturally (e.g., "investigate the high error rate on my API", "audit my CloudTrail for IAM changes", "check my codebase for observability gaps") and the agent auto-triggers the aws-observability skill, loads relevant references, and uses the MCP servers to execute the workflow.

Prerequisites

  • AWS CLI configured with credentials
  • Python 3.10+ and uv installed
  • Required IAM permissions: cloudwatch:*, logs:*, xray:*, cloudtrail:*, application-signals:*, synthetics:Get*, s3:GetObject, s3:ListBucket, iam:Get*

Out of scope

  • AWS resource provisioning or modification: This plugin is read-only for observability data; it does not create, modify, or delete AWS resources
  • Custom dashboard creation: The plugin queries data but does not create CloudWatch Dashboards or other persistent UI artifacts
  • Automated remediation: The plugin identifies issues and provides recommendations but does not automatically fix them
  • Non-AWS observability platforms: Integration with Datadog, Splunk, Grafana, or other third-party monitoring tools
  • Cost Explorer integration: While referenced in some workflows, Cost Explorer MCP server integration is not included in this initial version

Potential challenges

  • IAM permissions breadth: The plugin requires broad permissions across CloudWatch, X-Ray, CloudTrail, Application Signals, and S3. Users with restricted IAM policies may encounter partial functionality. Mitigation: Clear prerequisites documentation and graceful error handling guidance in reference files.
  • Reference file size: Some reference files (security-auditing.md, performance-monitoring.md, incident-response.md) exceed the 100-line guideline from DESIGN_GUIDELINES.md due to the breadth of query patterns and workflows. Mitigation: Content is organized with clear headings for selective loading; the SKILL.md itself stays well under 300 lines.
  • MCP server availability: All four MCP servers are published on PyPI as uvx-installable packages. If any server has breaking changes, the plugin may need updates. Mitigation: Using @latest version pins for automatic updates.
  • Region and profile configuration: Default configuration uses default AWS profile and us-east-1 region. Users must manually update .mcp.json env vars for different profiles/regions. Mitigation: Configuration section in SKILL.md provides clear instructions.

Dependencies and Integrations

Dependencies (all from AWS Labs):

Integration with existing plugins:

  • Complements deploy-on-aws by providing post-deployment monitoring and troubleshooting capabilities
  • The CloudTrail security auditing capability pairs well with infrastructure changes made via the deploy plugin

Alternative solutions

  1. Individual MCP server setup without a plugin: Users could manually configure each MCP server and write their own prompts. The plugin adds value through curated skill descriptions, progressive-disclosure reference files, and pre-built workflow patterns that guide the agent through complex multi-tool observability tasks.

  2. Separate plugins per capability: Could split into aws-cloudwatch, aws-application-signals, aws-cloudtrail plugins. However, observability workflows frequently span multiple tools (e.g., incident response correlates alarms + logs + traces + CloudTrail changes), making a unified plugin more effective.


Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions