diff --git a/plugins/aws-serverless/skills/api-gateway/SKILL.md b/plugins/aws-serverless/skills/api-gateway/SKILL.md new file mode 100644 index 0000000..8de8b6a --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/SKILL.md @@ -0,0 +1,219 @@ +--- +name: api-gateway +description: > + Build, manage, govern, and operate APIs using Amazon API Gateway. + TRIGGER when: user asks about API Gateway configuration, architecture, + security, authentication, custom domains, deployments, throttling, + caching, CORS, VPC links, private APIs, mTLS, Lambda authorizers, + usage plans, monitoring, logging, or API governance. Also trigger when + troubleshooting API errors (400, 401, 403, 429, 500, 502, 504), + timeout issues, or CORS failures on API Gateway. Also trigger when + working with SAM, CDK, CloudFormation, or Terraform templates that + contain API Gateway resources (AWS::ApiGateway::*, AWS::ApiGatewayV2::*, + AWS::Serverless::Api, AWS::Serverless::HttpApi), or when designing + API architectures on AWS. Even if the user doesn't mention "API Gateway" + by name, trigger if they're clearly building an AWS API (e.g., "expose + my Lambda as a REST endpoint", "add auth to my AWS API", "my API + returns 502"). Do NOT trigger for general REST API design unrelated + to AWS, or for non-API-Gateway AWS services used independently. +metadata: + tags: [api-gateway, serverless, aws, rest-api, http-api, websocket] +--- + +# Amazon API Gateway Development + +Expert guidance for building, managing, governing, and operating APIs with Amazon API Gateway. Covers REST APIs (v1), HTTP APIs (v2), and WebSocket APIs. + +## How to Use This Skill + +When answering API Gateway questions: + +1. Read the relevant reference file(s) before responding, do not rely solely on this summary +2. For tasks spanning multiple concerns (e.g., "private API with mTLS and custom domain"), read all relevant references +3. When the user needs IaC templates, consult `references/sam-cloudformation.md` or `references/sam-service-integrations.md` and provide complete, working SAM/CloudFormation YAML +4. Always mention relevant pitfalls and limits that affect the user's design + +## Quick Decision: Which API Type? + +Choose the right API type first. This decision affects every downstream choice. + +**REST API** is the full-featured API management platform for enterprises. It provides the governance, security, monetization, and operational controls that organizations need to build, publish, and manage APIs at scale, including usage plans with per-consumer throttling and quotas, API keys, request validation, WAF integration, resource policies, caching, canary deployments, and private endpoints. + +**HTTP API** is the lightweight, low-cost proxy optimized for simpler API workloads. It offers ~70% lower cost and lower latency but trades away the API management features. Choose HTTP API when you need a fast, lightweight proxy to Lambda or HTTP backends and don't require the enterprise controls above. + +| Factor | REST API (v1) | HTTP API (v2) | WebSocket API | +| ------------------------- | -------------------------------------- | ---------------------------------------------- | ------------------------------ | +| **Positioning** | **Full API management** | **Low-cost proxy** | **Real-time bidirectional** | +| Cost | Higher | ~70% cheaper | Per-message pricing | +| Latency | Higher | Lower | Persistent connection | +| Max timeout | 50ms-29s (up to 300s Regional/Private) | 30s hard limit | 29s | +| Payload | 10 MB | 10 MB | 128 KB message / 32 KB frame | +| **API Management** | | | | +| Usage plans/API keys | Yes | No | No | +| Request validation | Yes (JSON Schema draft 4) | No | No | +| Caching | Yes (0.5-237 GB) | No | No | +| Custom gateway responses | Yes | No | No | +| VTL mapping templates | Yes | No (parameter mapping only) | Yes | +| **Security & Governance** | | | | +| WAF | Yes | No (use CloudFront + WAF) | No | +| Resource policies | Yes | No | No | +| Private endpoints | Yes | No | No | +| mTLS | Yes (Regional custom domain only) | Yes (Regional custom domain only) | Via CloudFront viewer mTLS | +| **Auth** | | | | +| Lambda authorizer | Yes (TOKEN + REQUEST) | Yes (REQUEST only, simple + IAM policy format) | Yes (REQUEST on $connect only) | +| JWT authorizer | No (use Cognito authorizer) | Yes (native) | No | +| Cognito authorizer | Yes (native) | Use JWT authorizer | No | +| **Operations** | | | | +| Canary deployments | Yes | No | No | +| Response streaming | Yes | No | No | +| X-Ray tracing | Yes | No | No | +| Execution logging | Yes | No | Yes | +| Custom domain sharing | Not with WebSocket | Not with WebSocket | Not with REST/HTTP | + +**Use REST API when**: you are building APIs for external consumers, partners, or multi-tenant platforms; need to enforce per-consumer rate limits and quotas; require request validation, caching, or WAF at the API layer; need private endpoints, resource policies, or canary deployments; or are building an API product with monetization and governance requirements. + +**Use HTTP API when**: you are building lightweight APIs or simple backend proxies; cost and latency are the primary concerns; you don't need per-consumer throttling, request validation, caching, or WAF at the API layer; and native JWT authorization with OIDC/OAuth 2.0 meets your auth needs. Accept the hard 30s timeout and lack of API management features. For WAF, edge caching, or edge compute, place a CloudFront distribution in front of the HTTP API. + +**Use WebSocket API when you need**: persistent bidirectional connections for real-time use cases (chat, notifications, live dashboards). + +## Instructions + +### Step 1: Design the API + +Before implementation, gather requirements systematically. Consult `references/requirements-gathering.md` for the full requirements workflow covering endpoints, auth, data models, performance, security, and deployment needs. + +Key design decisions: + +1. **API type**: Use the decision table above +2. **Endpoint type**: Edge-optimized (default for global clients; optimizes TCP connections via CloudFront POPs but does not cache at the edge), Regional (same-region clients, or global clients needing their own CloudFront distribution for edge caching, edge compute, granular WAF control, or geo-based routing), Private (VPC-only access, REST API only) +3. **Topology**: Centralized (single domain, path-based routing) vs Distributed (subdomains per service) +4. **Authentication**: See `references/authentication.md` for the decision tree + +### Step 2: Implement the API + +Consult these references based on what you're building: + +- **Architecture patterns**: `references/architecture-patterns.md`: topology, multi-tenant SaaS, hybrid workloads, private APIs, multi-region, streaming +- **WebSocket API**: `references/websocket.md`: route selection, @connections management, session management, client resilience, SAM templates, limits, multi-region +- **Service integrations**: `references/service-integrations.md`: direct AWS service integrations (EventBridge, SQS, SNS, DynamoDB, Kinesis, Step Functions, S3), HTTP proxy, mock, VTL mapping templates, binary media types, Lambda sync/async invocation +- **Custom domains and routing**: `references/custom-domains-routing.md`: base path mappings, routing rules, header-based versioning +- **Security**: `references/security.md`: mTLS (API Gateway native + CloudFront viewer mTLS), TLS policies, resource policies, WAF, HttpOnly cookies, CRL checks +- **SAM/CloudFormation**: `references/sam-cloudformation.md`: IaC patterns, OpenAPI extensions, VTL reference, binary data +- **SAM service integration templates**: `references/sam-service-integrations.md`: EventBridge, SQS, DynamoDB CRUD, Kinesis, Step Functions (REST + WebSocket) templates + +### Step 3: Configure Performance and Scaling + +- **Throttling**: Account-level default is 10,000 rps / 5,000 burst (adjustable; request increases via AWS Support). Configure stage-level and method-level throttling via usage plans. See `references/performance-scaling.md` +- **Caching** (REST only): Default TTL 300s, max 3600s. Only GET methods cached by default. Max cached response 1 MB +- **Edge caching** (all API types): For edge caching, place a self-managed CloudFront distribution in front of a Regional API. CloudFront reduces latency, backend load, AND cost (cached responses never reach API Gateway). Also enables edge compute (CloudFront Functions, Lambda@Edge) and granular cache behaviors per path. Use a Regional endpoint, not edge-optimized, when pairing with your own CloudFront distribution +- **Scaling**: API Gateway scales automatically but plan the entire stack (Lambda concurrency, DynamoDB capacity) + +### Step 4: Set Up Observability + +Always configure access logging. For REST and WebSocket APIs, also enable execution logging (ERROR level for production, INFO only for debugging). **HTTP API does not support execution logging**; use access logs with enhanced observability variables instead. + +Consult `references/observability.md` for: + +- Recommended access log formats (separate formats for REST, HTTP API, and WebSocket) +- Enhanced observability variables for phase-level troubleshooting (REST API: WAF -> Authenticate -> Authorizer -> Authorize -> Integration) +- CloudWatch alarms to configure for production +- Log retention policies (CloudWatch Logs default to Never Expire) +- Logging setup prerequisites (different for REST/WebSocket vs HTTP API) + +### Step 5: Deploy + +- Use Infrastructure as Code (SAM, CDK, CloudFormation, Terraform) for production +- **Canary deployments** (REST only): Route a percentage of traffic to test new versions +- **Blue/green deployments**: Use custom domain API mappings to switch between environments with zero downtime +- **Routing rules** (preferred for new domains): Declarative header/path-based routing on custom domains for versioning, A/B testing, gradual rollouts, and cell-based routing +- See `references/deployment.md` for detailed patterns + +### Step 6: Apply Governance + +For organization-wide API standards, see `references/governance.md` covering: + +- Preventative controls (SCPs, IAM policies) +- Proactive controls (CloudFormation Hooks, Guard rules) +- Detective controls (AWS Config rules, EventBridge) +- Specific enforcement examples for security, observability, and management + +## Response Format + +When responding to API Gateway questions, structure your answer as: + +1. **Recommendation**: Lead with the recommended approach and why +2. **Code**: Include SAM/CloudFormation YAML or code when the user needs implementation (always read the relevant reference file first) +3. **Pitfalls**: Warn about relevant gotchas from the pitfalls below or from `references/pitfalls.md` +4. **Limits**: Mention any service limits that constrain the design + +## Troubleshooting Quick Reference + +When diagnosing API Gateway errors, consult `references/troubleshooting.md` for detailed resolution steps. Here are the most common issues: + +| Error | Most Common Cause | Quick Fix | +| ---------------------- | ----------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | +| 400 Bad Request | Protocol mismatch (HTTP/HTTPS) with ALB | Match protocol to listener type | +| 401 Unauthorized | Wrong token type (ID vs access) or missing identity sources | Check token type matches scope config; verify all identity sources sent | +| 403 Missing Auth Token | Stage name in URL when using custom domain | Remove stage name from URL path | +| 403 from VPC | Private DNS on VPC endpoint intercepts ALL API calls | Use custom domain names for public APIs | +| 403 Access Denied | Resource policy + auth type mismatch or missing redeployment | Review policy, check auth type, redeploy API | +| 403 mTLS | Certificate issuer not in truststore or weak signature algorithm | Verify CA in truststore, use SHA-256+ | +| 429 Too Many Requests | Account/stage/method throttle limits exceeded | Implement jittered exponential backoff; request limit increase | +| 500 Internal Error | Missing Lambda invoke permission (especially with stage variables) | Add resource-based policy to Lambda function | +| 502 Bad Gateway | Lambda response not in required proxy format | Return `{statusCode, headers, body}` from Lambda | +| 504 Timeout | Backend exceeds 29s (REST, increasable) or 30s (HTTP, hard). HTTP API body says "Service Unavailable" but status is 504 | Optimize backend, request timeout increase (REST Regional/Private), or switch to async invocation | +| CORS errors | Missing CORS headers on Gateway Responses (4XX/5XX) | Add CORS headers to DEFAULT_4XX and DEFAULT_5XX gateway responses | +| SSL/PKIX errors | Incomplete certificate chain on backend | Provide full cert chain; use `insecureSkipVerification` only for testing | + +## Critical Pitfalls + +1. **REST API default timeout is 29 seconds** (increasable up to 300s for Regional/Private endpoints via quota request). Lambda continues running but client gets 504. Request a timeout increase, or consider async patterns (SQS, EventBridge) for better user experience on long operations +2. **HTTP API hard timeout is 30 seconds**. Returns `{"message":"Service Unavailable"}` while Lambda continues +3. **`/ping` and `/sping` are reserved paths**. Do not use for API resources +4. **Execution log events truncated at 1,024 bytes**. Use access logs for complete data +5. **413 `REQUEST_TOO_LARGE` is the only gateway response that cannot be customized**. Use DEFAULT_4XX as a catch-all to add CORS headers for all 4xx errors including 413 +6. **`maxItems`/`minItems` not validated** in REST API request validation +7. **Root-level `security` in OpenAPI is ignored**. Must set per-operation +8. **JWT authorizer public keys cached 2 hours**. Account for this in key rotation +9. **Management API rate limit: 10 rps / 40 burst**. Heavy automation can hit this +10. **Always redeploy REST API after configuration changes**. Changes don't take effect until deployed +11. **Edge-optimized endpoints do NOT cache at the edge** — they only optimize TCP connections via CloudFront POPs. If you need edge caching, edge compute (CloudFront Functions, Lambda@Edge), or granular CloudFront control, use a Regional API with your own CloudFront distribution instead + +For additional pitfalls (header handling, URL encoding, caching charges, canary deployments, usage plans), see `references/pitfalls.md`. + +## IaC Framework Selection + +Default: CDK TypeScript + +Override syntax: + +- "use SAM" → Generate SAM/CloudFormation YAML templates +- "use CloudFormation" → Generate CloudFormation YAML templates +- "use Terraform" → Generate Terraform HCL + +When not specified, ALWAYS use CDK TypeScript. + +## Error Scenarios + +### MCP Server Unavailable + +- Inform user: "AWS Serverless MCP not responding" +- Ask: "Proceed without MCP support?" +- DO NOT continue without user confirmation + +## Service Limits Quick Reference + +See `references/service-limits.md` for the complete table. **Most numeric quotas below are default values and adjustable**; check with your AWS account team and the [latest quotas page](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html) before using them for architectural decisions. Key limits: + +| Resource | REST API | HTTP API | WebSocket | +| ------------------------ | ---------------------------------------- | -------- | ------------------- | +| Payload size | 10 MB | 10 MB | 128 KB | +| Integration timeout | 50ms-29s (up to 300s Regional/Private) | 30s hard | 29s | +| APIs per region | 600 Regional/Private; 120 Edge-optimized | 600 | 600 | +| Stages per API | 10 | 10 | 10 | +| Routes/resources per API | 300 | 300 | 300 | +| Custom domains (public) | 120 | 120 | 120 | +| Account throttle | 10,000 rps / 5,000 burst | Same | Same (shared quota) | +| API keys per region | 10,000 | N/A | N/A | +| Usage plans per region | 300 | N/A | N/A | +| Cache sizes | 0.5 GB - 237 GB | N/A | N/A | diff --git a/plugins/aws-serverless/skills/api-gateway/references/architecture-patterns.md b/plugins/aws-serverless/skills/api-gateway/references/architecture-patterns.md new file mode 100644 index 0000000..77bb5cc --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/architecture-patterns.md @@ -0,0 +1,189 @@ +# Architecture Patterns + +## Topology Patterns + +**Three topology patterns:** + +1. **Single AWS account**: Simplest. All APIs in one account with routing rules or base path mappings +2. **Separate AWS accounts per domain/application**: Better isolation. Each account owns a subdomain (e.g., `orders.example.com`, `shipping.example.com`) and can contain multiple microservices behind it. No cross-account base path mappings, so subdomain-per-account is the routing mechanism +3. **Central API account**: Central account owns the custom domain and routes to backend APIs in other accounts. Centralized governance, throttling, metering, and observability + +### API Gateway as Single Entry Point + +- Both single AWS account or central API account scenarios +- Map different custom domain subdomains to the same API with routing rules or different base path mappings +- Route different paths (`/service1`, `/service2`, `/docs`) to different backends + +**Endpoint type selection**: + +- **Regional** (default): API deployed in a single region. Best when clients are in the same region or when using your own CloudFront distribution for edge caching/WAF control. Supports custom domains with ACM certificates in the same region +- **Edge-optimized**: Routes requests through CloudFront POPs for optimized TCP connections to global clients. Does **NOT** cache at the edge. For actual edge caching, use a self-managed CloudFront distribution with a Regional API. ACM certificate must be in `us-east-1` +- **Private**: Accessible only from within a VPC via `execute-api` VPC endpoint. REST API only. See Private API Endpoints section below + +**Trade-offs with central account**: + +- X-Ray traces can span accounts using CloudWatch cross-account observability (source/monitoring account linking), but require explicit setup +- Usage plans cannot track across accounts without aggregation +- CloudWatch dashboards require aggregation in central account + +## Multi-Tenant SaaS Specific Concerns + +- Tiered usage plans (free/pro/enterprise or bronze/silver/gold) +- Lambda authorizer validates JWT from external IdP, extracts tenant ID, retrieves per-tenant API key from DynamoDB, returns it in `usageIdentifierKey` for transparent per-tenant throttling (see `references/authentication.md` for full flow) +- Set `ApiKeySourceType: AUTHORIZER` so API Gateway reads the key from the authorizer response, so tenants never see or manage API keys +- Onboarding automation: create tenant in IdP + API key in API Gateway + mapping in DynamoDB + associate key with usage plan tier +- Forward tenant identification as custom header to backend for tenant-specific logic +- **Lambda tenant isolation mode**: For compute-level tenant isolation beyond throttling, create Lambda functions with `--tenancy-config '{"TenantIsolationMode": "PER_TENANT"}'`. Lambda isolates execution environments per tenant: each environment is reused only for invocations from the same tenant, preventing cross-tenant data access via in-memory or `/tmp` storage. API Gateway maps the tenant ID to the `X-Amz-Tenant-Id` header on the Lambda integration request (e.g., from a Lambda authorizer context value via `context.authorizer.tenantId`, or from a client request header via `method.request.header.x-tenant-id` mapped to `integration.request.header.X-Amz-Tenant-Id`). Tenant ID is available in the handler context object (`context.tenantId` in Node.js, `context.tenant_id` in Python). Must be set at function creation time (cannot be changed later). Expect more cold starts since execution environments are not shared across tenants. All tenants share the function's execution role; for fine-grained per-tenant permissions, propagate tenant-scoped credentials from upstream components + +## Integration Patterns + +API Gateway supports five integration types: `AWS`, `AWS_PROXY`, `HTTP`, `HTTP_PROXY`, and `MOCK`. See `references/service-integrations.md` for detailed configuration of each pattern. + +**Lambda integrations** (`AWS_PROXY` / `AWS`), the most common integration type. `AWS_PROXY` (Lambda proxy) is the recommended default: API Gateway passes the full request to Lambda and returns the Lambda response directly, no mapping templates needed. `AWS` (Lambda non-proxy) allows VTL request/response transformation but requires more setup. + +**Direct AWS service integrations**: integrate directly with AWS services without Lambda. Two implementation approaches: + +- **REST API and WebSocket API** use `Type: AWS` with VTL mapping templates for full request/response transformation. Supports most of AWS services' actions +- **HTTP API** uses first-class integrations (`Type: AWS_PROXY` with `IntegrationSubtype`) with parameter mapping instead of VTL. Supported services: EventBridge (`PutEvents`), SQS (`SendMessage`, `ReceiveMessage`, `DeleteMessage`, `PurgeQueue`), Kinesis (`PutRecord`), Step Functions (`StartExecution`, `StartSyncExecution`, `StopExecution`), and AppConfig (`GetConfiguration`). DynamoDB, SNS, and S3 are not available as HTTP API first-class integrations; use Lambda proxy instead + +Most commonly used service integrations (REST API `Type: AWS` can integrate with any AWS service that has an HTTP API; the list below covers the most popular patterns; see `references/service-integrations.md` for details): + +- **EventBridge**: event ingestion +- **SQS**: async message buffering +- **SNS**: fan-out pub/sub to multiple subscribers (REST/WebSocket only) +- **DynamoDB**: full CRUD with optional Streams for async processing (REST/WebSocket only) +- **Kinesis Data Streams**: high-throughput ordered data ingestion +- **Step Functions**: workflow orchestration (sync Express or async Standard). +- **S3**: file upload/download proxy with binary media type support (REST/WebSocket only) + +**HTTP integrations** (`HTTP` / `HTTP_PROXY`): proxy to any HTTP endpoint (ALB, NLB, ECS, EC2, on-premises, external APIs). Use VPC Link for private backends. Available on REST and HTTP APIs. API Gateway is a valid choice for east-west (service-to-service) traffic when API management capabilities are needed beyond what load balancing provides (throttling, usage plans, request validation, authentication, and centralized observability). For internal calls that do not need these controls, prefer direct invocation (ALB, service mesh, or Lambda-to-Lambda) for lower latency and cost. + +**Mock integrations** (`Type: MOCK`): responses without any backend (health checks, CORS preflight, prototyping) + +Common patterns across all integrations: IAM execution roles, request validation, response mapping, Lambda sync/async invocation, backend bypass prevention (zero trust), and binary media type handling. **Security note for direct service integrations**: Use VTL mapping templates and API Gateway request validators. Every field that reaches the AWS service must be explicitly constructed in the mapping template. Never pass user input directly into service parameters without validation. Scope the IAM execution role to minimum required actions and specific resource ARNs (e.g., a single SQS queue, a single DynamoDB table). For S3 integrations, hardcode the bucket and validate key patterns to prevent path traversal. + +## Hybrid / On-Premises Workloads + +Connect API Gateway to on-premises or edge applications: + +1. VPC with connectivity to on-prem (VPN/Direct Connect/Transit Gateway) +2. NLB/ALB with target group using IP addresses to register on-prem server IPs +3. VPC Link to the NLB/ALB +4. API Gateway with `VPC_LINK` integration type + +**AWS Outposts**: Workloads running on Outposts (EC2, ECS, ALB) can also serve as integration targets. Outposts extend the VPC into the on-premises environment, so the same VPC Link + NLB/ALB pattern applies. Register Outposts instance IPs in the NLB target group + +**Connectivity considerations**: This pattern assumes stable, low-latency connectivity to the on-premises location. AWS Direct Connect provides the most reliable path. Site-to-Site VPN connections are inherently less stable; tunnel flaps cause NLB/ALB targets to become unreachable. With default NLB/ALB health check settings (30s interval, 3 failures), there is a ~90-second window where API Gateway sends traffic to unreachable targets, resulting in integration timeouts. Tune NLB health check intervals (10s, 2 failures) and set API Gateway integration timeouts to match your SLA. Implement a `/health` endpoint on the on-premises target that validates downstream dependencies. Monitor NLB/ALB `UnHealthyHostCount` and alarm on it + +## Private API Endpoints (not accessible from the public Internet) + +- REST API only, accessible via VPC interface endpoint for `execute-api` +- Resource policy must allow access from VPC endpoint or VPC +- Deploy VPC endpoints across multiple AZs for high availability +- `disableExecuteApiEndpoint: true` forces traffic through custom domain only + +### Private API as External API Proxy + +- Private APIs can proxy external/third-party APIs for workloads in isolated VPCs (no NAT gateway or internet access needed) +- API Gateway is a managed service in an AWS-managed VPC; it has internet connectivity even when your VPC does not +- Pattern: Private API (VPC endpoint) -> HTTP_PROXY integration -> external API +- Adds centralized logging, throttling, and access control to external API calls +- **Security warning**: This pattern effectively grants internet egress to an isolated VPC through API Gateway. Lock down the HTTP_PROXY integration to specific allowed external domains. Do not use parameterized URLs that callers can control. Apply a resource policy restricting which VPC endpoints can invoke the API. Enable full access logging. This egress path does not appear in VPC flow logs or network firewall logs, so security teams must be aware of it as a potential data exfiltration vector + +### Private API Cross-Account Access + +- **Pattern 1**: VPC endpoint in consumer account + resource policy in producer account allowing `aws:SourceVpce`. Combine with IAM authorization (SigV4) or Lambda authorizer for defense in depth. The resource policy controls network-level access, but without authentication any workload in the consumer VPC that can reach the VPC endpoint can invoke the API +- **Pattern 2**: PrivateLink between accounts with VPC endpoint +- **Pattern 3**: Transit Gateway connecting VPCs across accounts (one of them with VPC endpoint ) + +### Custom Domains for Private APIs + +- Private custom domain names (`AWS::ApiGateway::DomainNameV2`), dualstack only +- Share cross-account via AWS RAM using domain name access associations +- Route 53 private hosted zone with alias record pointing to VPC endpoint regional DNS + +### Enforcing CloudFront as Sole Entry Point + +To prevent clients from bypassing CloudFront and hitting API Gateway directly (skipping WAF, caching, geo-restrictions): + +- **Private API approach**: Make the API Gateway endpoint private (VPC endpoint only), place CloudFront in front with VPC Origins: CloudFront → VPC Origin (internal ALB) → execute-api VPC endpoint → private API. All traffic stays within AWS private network and the API is unreachable from the public internet without CloudFront +- **Regional API + restrictions** (defense-in-depth, not a security boundary): Keep a regional endpoint but restrict direct access. Use a custom header from CloudFront (via origin custom headers) and validate it in a Lambda authorizer. Combine with disabling the default `execute-api` endpoint to force traffic through the custom domain fronted by CloudFront. **Caveat**: The header value is a static secret. If leaked through logs, source code, or developer machines, attackers can bypass CloudFront. Rotate the value regularly, store it in AWS Secrets Manager, and treat it as a credential. + +### On-Premises Access to Private APIs + +- AWS Direct Connect or Site-to-Site VPN to reach VPC with VPC endpoint +- Route 53 Resolver inbound endpoints for on-premises DNS resolution of VPC endpoint DNS names or private custom domain + +## VPC Links + +VPC Links enable API Gateway to reach **private integration targets** inside a VPC that are not publicly accessible. API Gateway creates a private connection to the VPC without exposing the backend to the internet. + +- **VPC Link v2** (`AWS::ApiGatewayV2::VpcLink`): Supported by REST and HTTP APIs, targets ALB, NLB, and Cloud Map (for HTTP APIs) services. One VPC link per VPC can serve multiple backends. Prefer v2 for new integrations +- **VPC Link v1** (`AWS::ApiGateway::VpcLink`): Used by WebSocket API (and legacy REST API integrations), targets NLB only +- **Not the same as private endpoints**: A _private API endpoint_ restricts who can **call** the API (only from within a VPC via `execute-api` VPC endpoint). A _VPC Link_ controls where the API **forwards requests to** (private backends in a VPC). These are independent: a public API can use VPC Links to reach private backends, and a private API can call public HTTP endpoints without VPC Links + +## Multi-Region + +### Foundational Setup + +- API Gateway custom domain names are **regional resources**. Create the same custom domain name (e.g., `api.example.com`) independently in each region +- Each region requires its own ACM certificate for the domain. ACM certificates are also regional. Request or import in every region where the API is deployed +- Route 53 alias records point to each region's API Gateway regional domain name (the `d-xxxxxx.execute-api.{region}.amazonaws.com` target provided when creating the custom domain) +- Deploy the full stack (API Gateway, Lambda, DynamoDB, etc.) independently per region; there is no cross-region replication of API Gateway configuration +- Use IaC (SAM/CDK/Terraform) with parameterized region to ensure consistent deployments across regions + +### Active-Passive Failover + +- Route 53 failover routing policy with health checks on the primary region +- Health checks monitor a `/health` endpoint or a CloudWatch alarm (e.g., on 5XX error rate or backend availability) +- On primary failure, Route 53 automatically routes all traffic to the secondary region +- Route 53 Application Recovery Controller (ARC) for manual failover switches when automated routing is insufficient (e.g., data corruption in one region) +- **RPO/RTO trade-off**: Health check interval (10s or 30s) + failover propagation (~60-120s DNS TTL) determines theoretical failover speed. **In practice, plan for 3-10 minutes**, as many clients cache DNS aggressively beyond TTL (Java caches successful lookups indefinitely by default, mobile SDKs and corporate resolvers vary). Set Route 53 record TTL to 60s, but do not size SLAs around sub-minute failover. For faster failover, use Global Accelerator (anycast IP, no DNS propagation delay) or CloudFront with origin failover (seconds, not minutes) + +### Active-Active + +- Route 53 latency-based or geo-based routing to nearest region +- All regions serve traffic simultaneously; both must be fully provisioned, not just on standby +- **Data sovereignty**: Latency-based routing may route EU users to US regions (or vice versa) if latency is lower, potentially violating GDPR or other data residency requirements. Use geo-based routing (combined with tenant locality verification) when data sovereignty is a concern. Note that DynamoDB Global Tables replicate data to all configured regions regardless of routing, so do not add regions that would violate data residency constraints + +### Resilient Private APIs (Multi-Region) + +- Private API in each region + VPC endpoint + Custom Domain Name +- Route 53 private hosted zone with latency-based or failover routing +- Transit Gateway with inter-Region peering for VPC connectivity +- **Health checks must be CloudWatch alarm-based**: Route 53 health checkers run from the public internet and cannot reach private API endpoints. Create CloudWatch alarms on NLB `UnHealthyHostCount`, API Gateway `5XXError` rate, or custom health metrics, then associate them with Route 53 health checks. Monitor Transit Gateway peering status separately; if inter-region peering fails, failover routing becomes critical + +## Response Streaming + +- Still a **request/response pattern**: client sends a request and receives a streamed response. The connection is one-directional (server to client) and closes when the response completes. For bidirectional real-time communication, use WebSocket API (see `references/websocket.md`) +- REST API only; not available for HTTP API or WebSocket API +- Set `responseTransferMode: "STREAM"` on integration +- Supports HTTP_PROXY, Lambda proxy, and private integrations (ALB/NLB/Cloud Map backends via VPC Link) +- **Lambda integrations**: Use `awslambda.streamifyResponse()` and `HttpResponseStream.from()` +- **HTTP integrations**: Backend sends a chunked transfer-encoded response (`Transfer-Encoding: chunked`); API Gateway streams chunks to the client as they arrive, with no Lambda required +- First 10 MB unrestricted; beyond 10 MB bandwidth limited to 2 MB/s +- Max streaming session: 15 minutes, removing the 10 MB buffered response limit +- Idle timeouts: 5 min (Regional/Private), 30 sec (edge-optimized) +- Billing: each 10 MB of response (rounded up) = 1 request +- **Limitations**: No VTL response transformation, no caching, no content encoding with streaming +- **Key use cases**: LLM chatbot implementations that stream sentence-by-sentence for better UX; large payload delivery beyond the 10 MB buffered response limit (up to 15 minutes of streaming); real-time data feeds (logs, metrics, event streams) where partial results are useful before the full response completes; file downloads from backend services where the client can begin processing immediately + +## Designing APIs for AI Agent Consumption + +As AI agents become API consumers, design considerations change: + +- **Rich documentation**: API descriptions must be detailed enough for LLMs to understand intent, not just for humans +- **Descriptive error messages**: AI agents need enough context in error responses to retry with corrective information +- **Minimize round-trips**: Consider how many requests are needed to perform one action. Batch operations and intent-based APIs (e.g., "manage user" vs. separate GET/PUT/DELETE) reduce agent complexity +- **Machine-friendly pagination**: Use cursor-based pagination that machine consumers can follow automatically +- **Resource-based vs intent-based**: Consider whether traditional CRUD or intent-based endpoints better serve AI consumers +- **Non-deterministic cost**: AI-backed APIs have variable processing cost per request (LLM token usage varies). Factor this into monetization and usage plan design + +## Reducing Backend Load + +- **Request validation at the front door**: Use API Gateway validators (headers, query strings, JSON schema) to reject bad requests before they reach the backend +- **WAF rules**: Block traffic from regions with no customers +- **Add pagination and filters**: Reduce response data volume +- **Batch operations**: Combine multiple small actions into single requests +- **Async processing**: Acknowledge request immediately, queue for backend processing at its own pace. Better for constrained backend resources +- **Caching strategy**: Use CloudFront caching first (reduces load, latency, AND cost, since the request never reaches API Gateway). Use API Gateway cache as fallback (reduces load and latency but NOT cost, as the request is still counted by API Gateway). See `references/performance-scaling.md` for cache sizing, TTL configuration, and multi-layer caching details diff --git a/plugins/aws-serverless/skills/api-gateway/references/authentication.md b/plugins/aws-serverless/skills/api-gateway/references/authentication.md new file mode 100644 index 0000000..5a1edc0 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/authentication.md @@ -0,0 +1,114 @@ +# Authentication and Authorization + +## Decision Tree + +``` +Is this a WebSocket API? + YES -> Lambda Authorizer (REQUEST type only on $connect; TOKEN type not + supported; cached policy applies for entire connection, must + cover all routes) or IAM (SigV4) + NO -> +Is the consumer an AWS service or resource? + YES -> IAM Authorization (SigV4) + NO -> Is the consumer a browser-based app? + YES -> Do you use Cognito? + YES -> REST API: Cognito User Pool Authorizer + HTTP API: JWT Authorizer (Cognito issuer) + NO -> Do you use another OIDC provider? + YES -> HTTP API: JWT Authorizer + REST API: Lambda Authorizer (validate JWT) + NO -> Lambda Authorizer (custom logic) + NO -> Is this machine-to-machine (M2M)? + YES -> Do you need certificate-based auth? + YES -> mTLS (Regional custom domain + S3 truststore) + NO -> OAuth 2.0 Client Credentials Grant (Cognito + JWT/Cognito authorizer) + NO -> Lambda Authorizer (most flexible) +``` + +## IAM Authorization (SigV4) + +- Works for REST, HTTP, and WebSocket APIs (WebSocket: evaluated on `$connect` only) +- Caller signs requests with AWS Signature Version 4 +- Best for: AWS-to-AWS service calls, Cognito identity pools, resources already integrated with IAM +- **Cross-account REST API**: Requires BOTH IAM policy (caller account) AND resource policy (API account) +- **Cross-account HTTP API**: No resource policies; use `sts:AssumeRole` to assume a role in the API account +- **Multi-region**: SigV4 signatures are region-specific: the signing region must match the region receiving the request. In multi-region deployments with Route 53 failover or latency-based routing, clients signing for one region will get auth failures if routed to another. SigV4a (multi-region signing) is not supported by API Gateway. Workarounds: use a region-agnostic auth mechanism (Lambda authorizer, JWT) for multi-region APIs, or implement client-side retry logic that re-signs for the correct region on auth failure + +## Lambda Authorizers + +### REST API + +- **TOKEN type**: Receives a single header value (typically `Authorization`) as input. Returns IAM policy document. If the identity source header is missing, API Gateway returns 401 immediately **without invoking the Lambda**; the authorizer function never gets the chance to handle missing tokens +- **REQUEST type**: Receives headers, query strings, stage variables, and context variables as input. Returns IAM policy document. When caching is enabled and identity sources are specified, a request missing any identity source returns 401 without invoking the Lambda +- Both types must return `principalId` (string identifying the caller) alongside the policy document. Missing `principalId` causes 500 Internal Server Error +- **Response limits**: IAM policy document max ~8 KB. Exceeding this or returning a malformed response causes 500 Internal Server Error (not 401/403), a common debugging pitfall +- **Caching**: TTL default 300s, max 3600s. Cache key is the token value (TOKEN type) or identity sources (REQUEST type). When caching is enabled, the IAM policy returned by the first request is reused for subsequent requests with the same cache key. If that policy only covers specific resources (e.g., the path of the initial request), subsequent requests to other paths will be denied by the cached partial policy, causing hard-to-troubleshoot failures where clients intermittently cannot access parts of the API. Always generate IAM policies that cover the entire API when caching is enabled + +### HTTP API + +- **Simple response format**: Returns `{isAuthorized: true/false, context: {...}}`, much simpler than IAM policy +- **IAM policy format**: Also supported for more complex authorization. When using IAM policy format with caching, the same full-API policy guidance from REST API applies; see REST API caching note above +- **Identity sources**: `$request.header.X`, `$request.querystring.X`, `$context.X`, `$stageVariables.X` +- **Caching**: Disabled by default (TTL=0), unlike REST API (TTL=300s). Add `$context.routeKey` to identity sources to cache per-route when enabling caching +- **Timeout**: 10,000ms max + +## JWT Authorizers (HTTP API Only) + +- **Validates**: `iss`, `aud`/`client_id`, `exp`, `nbf` (must be before current time), `iat` (must be before current time), `scope`/`scp` (against route-configured scopes). Uses `kid` for JWKS key lookup. Request is denied if any validation fails +- Only RSA-based algorithms supported (RS256, RS384, RS512). ECDSA (ES256, ES384, ES512) is not supported. If your IdP signs tokens with ECDSA, use a Lambda authorizer instead +- Public key cached for 2 hours; account for this in key rotation +- Token validation runs on every request (no result caching); only the JWKS public keys are cached (2 hours). This differs from REST API Cognito authorizer which caches the validation result +- JWKS endpoint timeout: 1,500ms +- Max audiences per authorizer: 50. Max scopes per route: 10 +- Use access tokens with scopes for authorization. ID tokens also work when no scopes are configured on the route, but access tokens are preferred for API authorization +- Only supports self-contained JWTs; opaque access tokens are not supported. If your IdP issues opaque tokens by default, use a Lambda authorizer instead +- Works natively with Cognito, Auth0, Okta, and any OIDC-compliant provider + +## Cognito User Pools (REST API) + +- Native authorizer type for REST APIs +- When no OAuth scopes configured on the method: use **ID token** +- When scopes configured: use **access token** +- Set up: Create user pool, app client, configure scopes on resource server +- **Token revocation not enforced**: The Cognito authorizer validates tokens locally (signature + claims) and does not check revocation status with Cognito. Revoked tokens (`GlobalSignOut`, `AdminUserGlobalSignOut`) are accepted until the token's `exp` time, as revocation is invisible to local validation regardless of caching. Separately, caching (default TTL 300s) means expired tokens may be accepted for up to the TTL duration after `exp`. For immediate revocation, use a Lambda authorizer with token introspection instead +- **M2M auth**: OAuth 2.0 Client Credentials Grant (confidential app client with client ID + secret, custom resource server scopes). Also works with HTTP API JWT authorizer using Cognito as issuer + +## Resource Policies (REST API Only) + +Four key use cases: + +1. **Cross-account access**: Allow specific AWS accounts by specifying the account principal in the `Principal` field +2. **IP filtering**: Allow/deny CIDR ranges via `aws:SourceIp` (public) or `aws:VpcSourceIp` (private/VPC) +3. **VPC restriction**: Restrict to specific VPCs via `aws:SourceVpc` +4. **VPC endpoint restriction**: Restrict to specific VPC endpoints via `aws:SourceVpce` + +### Policy Evaluation + +Evaluation depends on which auth type is combined with the resource policy: + +- **Same account + IAM or Lambda authorizer**: OR logic. If the auth mechanism allows, access is granted even if the resource policy has no matching statement (silent). An explicit Deny in the resource policy still wins +- **Same account + Cognito**: AND logic. Both the Cognito authorizer and the resource policy must allow +- **Resource policy alone** (no other auth): Must explicitly allow, otherwise request is denied +- **Cross-account**: AND logic. BOTH resource policy AND caller auth must explicitly allow. A silent resource policy results in implicit deny. This applies regardless of auth type (IAM, Cognito, Lambda authorizer) +- An explicit Deny always wins regardless of combination +- **Always redeploy the API after changing the resource policy** + +## Mutual TLS (mTLS) + +- Truststore in S3 (PEM-encoded, max 1,000 certs, max 1 MB). Certificate chain max 4 levels deep; minimum SHA-256 signature, RSA-2048 or ECDSA-256 key strength +- S3 bucket must be in the same region as API Gateway; enable versioning for rollback +- Works with **Regional** custom domain names for REST and HTTP APIs. Edge-optimized custom domains do not support mTLS +- WebSocket APIs do not support native mTLS; use CloudFront viewer mTLS instead (see `references/security.md`) +- ACM certificate required for the API Gateway domain (ACM-issued or imported) for server-side TLS. Truststore accepts CA certificates from any source (ACM Private CA, commercial CA, self-signed root); just needs PEM format +- **Private APIs do not natively support mTLS**. Use ALB as a reverse proxy in front: Client → ALB (mTLS verify with trust store) → VPC endpoint → Private API Gateway → backend. The ALB terminates the mTLS handshake, validates the client certificate, and forwards the request to the private API via the execute-api VPC endpoint +- **Disable default endpoint**: Always set `disableExecuteApiEndpoint: true` when using mTLS; otherwise clients can bypass mTLS entirely by calling the default `execute-api` URL directly +- **CRL checks**: API Gateway does not check Certificate Revocation Lists. Implement via Lambda authorizer checking against CRL in DynamoDB/S3 +- **Certificate propagation to backend**: Use Lambda authorizer to extract subject, return in context, inject as custom header via `RequestParameters` + +## API Keys + +- **Not a primary authorization mechanism** (easily shared/exposed) +- Use with usage plans for throttling/quota enforcement only +- Max 10,000 API keys per region (adjustable). Imported key values must be 20-128 characters +- Key source: `HEADER` (default, `x-api-key`) or `AUTHORIZER` (Lambda returns key in `usageIdentifierKey`) +- REST API only. HTTP API does not support API keys or usage plans diff --git a/plugins/aws-serverless/skills/api-gateway/references/custom-domains-routing.md b/plugins/aws-serverless/skills/api-gateway/references/custom-domains-routing.md new file mode 100644 index 0000000..0633118 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/custom-domains-routing.md @@ -0,0 +1,117 @@ +# Custom Domains and Routing + +## Custom Domain Names + +### Setup by Endpoint Type + +- **Edge-optimized**: ACM certificate must be in `us-east-1`. Creates an internal, AWS-managed CloudFront distribution (not visible in your CloudFront console, not configurable). Does **NOT** cache at the edge. For actual edge caching, use a separate CloudFront distribution with a Regional API. DNS CNAME/alias to CloudFront domain +- **Regional**: ACM certificate must be in same region as API. DNS CNAME/alias to regional domain name (`d-xxx.execute-api.region.amazonaws.com`) +- **Private**: REST API only. Dualstack only (`AWS::ApiGateway::DomainNameV2`). Domain name access associations link the domain to VPC endpoints. Route 53 alias in private hosted zone pointing to VPC endpoint regional DNS. Cross-account sharing via AWS RAM domain name access associations. ACM certificate in the same region + +**Certificate requirements:** + +- **Edge-optimized**: ACM-issued public certificate or certificate imported into ACM. Must be in us-east-1. Imported certificates must be [manually rotated before expiration](https://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-edge-optimized-custom-domain-name.html) +- **Regional and Private**: ACM-issued public certificate or certificate imported into ACM. Private CA certificates (ACM Private CA) are only for mTLS truststores, not for the domain itself + +### Limits + +- Public custom domains: 120/region +- Private custom domains: 50/region +- API mappings per domain: 200 +- Base path max length: 300 characters + +### Common Issues + +- **CNAMEAlreadyExists** (edge-optimized only): CNAME already associated with another CloudFront distribution. Delete or update existing CNAME first, or use Regional endpoint type to avoid this +- **Wrong certificate returned**: DNS record points to stage URL instead of API Gateway domain name target +- **Deletion quota**: 1 per 30 seconds. Use exponential backoff +- **403 "Missing Authentication Token"**: Stage name included in URL when using custom domain. Remove stage name from path + +## Base Path Mappings + +### Multi-Segment Paths + +- Paths can contain forward slashes: `/sales/reporting`, `/sales/reporting/v2`, `/corp/admin` +- Each routes to a different API endpoint +- Use `AWS::ApiGatewayV2::DomainName` and `AWS::ApiGatewayV2::ApiMapping` with `ApiMappingKey` +- Works with both REST (v1) and HTTP (v2) APIs +- Domain and APIs must be in same account and Region +- Each sub-application deployed independently + +### Multi-Tenant White-Label + +White-label domain support allows SaaS providers to serve multiple external customers through customer-specific subdomains (e.g., `customer1.example.com`, `customer2.example.com`) while routing all traffic through a single API Gateway API. Based on the pattern described in [Using API Gateway as a Single Entry Point for Web Applications and API Microservices](https://aws.amazon.com/blogs/architecture/using-api-gateway-as-a-single-entry-point-for-web-applications-and-api-microservices/) (AWS Architecture Blog). + +**Setup:** + +1. Register a domain (e.g., `example.com`) and create CNAME records for each customer subdomain (`customer1.example.com`, `customer2.example.com`) via Route 53 or your DNS provider +2. Create an ACM wildcard certificate (`*.example.com`), which covers one subdomain level only (`tenant1.example.com` matches, `a.tenant1.example.com` does not) +3. Create a custom domain in API Gateway for each customer subdomain using the wildcard certificate. Each subdomain can have its own base path mappings or routing rules, or use a shared mapping with backend routing based on the forwarded Host header +4. Point each subdomain's CNAME record to the API Gateway domain name target +5. Forward the original `Host` header as a custom header to the backend so it can identify the customer: + - REST API: map `method.request.header.host` to `integration.request.header.Customer` via `RequestParameters` + - HTTP API: use parameter mapping: `overwrite` on `integration.request.header.Customer` from `$request.header.host` + +**Key considerations:** + +- The wildcard certificate applied to API Gateway allows multiple subdomains to be served by a single API endpoint +- Each customer subdomain is created as a separate custom domain in API Gateway, enabling per-customer base path mappings or routing rules +- Backend microservices use the forwarded customer header to apply customer-specific business logic +- API Gateway's request/response transformation can insert or modify headers per customer +- The 120 public custom domains per region quota limits the number of customer subdomains (request increase if needed) + +## Routing Rules (Preferred for New Domains) + +**Routing rules are the recommended approach** over base path mappings for new custom domains, offering header-based routing, priority-based evaluation, and simpler management. Supports public and private REST APIs only. HTTP API and WebSocket API do not support routing rules; use base path mappings instead. + +### Rule Structure + +- **Conditions**: Up to 2 `MatchHeaders` + 1 `MatchBasePaths` (AND logic) +- **Actions**: Invoke any stage of any REST API in the same account and region +- **Priority**: 1-1,000,000 (lower = higher precedence, no duplicates). Leave gaps between priorities (100, 200, 300) to allow inserting new rules later. Creating a rule with a duplicate priority fails with `ConflictException` +- Header matching supports wildcards: `*latest` (matches values ending with "latest"), `alpha*` (matches values starting with "alpha"), `*v2*` (contains). Header names are case-insensitive; header values are case-sensitive + +### Routing Modes + +1. **API mappings only** (default): Traditional base path mapping behavior. Use if not adopting routing rules +2. **Routing rules then API mappings**: Routing rules take precedence; unmatched requests fall back to base path mappings. Use for zero-downtime migration from base path mappings to routing rules +3. **Routing rules only**: **Recommended mode** for new custom domains or after completing migration from base path mappings. Requests that match no routing rule receive a 404 response + +### Migration from Base Path Mappings + +1. Set routing mode to "Routing rules then API mappings" — existing base path mappings continue as fallback +2. Progressively create routing rules (e.g., start with a test header rule for controlled traffic). Include a catch-all rule (no conditions) at the lowest priority as a safety net; without this, unmatched requests will receive a 404 after switching modes in step 4 +3. Monitor with `$context.customDomain.routingRuleIdMatched` in access logs to verify routing behavior and confirm all expected traffic paths are covered by rules +4. Once all traffic is covered by rules, switch to "Routing rules only" mode + +### Implementation + +- CloudFormation: `AWS::ApiGatewayV2::RoutingRule` +- Observability: `$context.customDomain.routingRuleIdMatched` in access logs +- No additional charges for routing rules; standard API Gateway request pricing applies +- A rule with no conditions serves as a catch-all matching all requests + +### Use Cases + +- **API versioning**: Route by `Accept` or `X-API-Version` header to different API implementations +- **Gradual rollouts**: Route a percentage of users to new version by adding a header in application code, then gradually increase +- **A/B testing**: Route specific user cohorts by custom header (e.g., `x-test-group: beta-testers`) +- **Cell-based architecture**: Route by tenant ID or hostname header to different cell backends +- **Dynamic backend selection**: Route by cookie value, media type, or any custom header + +## Header-Based API Versioning + +Route API requests to different backend implementations based on a version header (REST APIs only). + +- Create a routing rule per version with `MatchHeaders` on the version header (e.g., `X-API-Version: v1`, `X-API-Version: v2`) +- Each rule invokes the corresponding API/stage +- Add a catch-all rule at the lowest priority to route unversioned requests to the default (latest stable) version +- Monitor with `$context.customDomain.routingRuleIdMatched` in access logs to track version adoption +- No additional infrastructure, no Lambda@Edge, no DynamoDB. Purely declarative + +## Host Header Forwarding + +- API Gateway overwrites Host header with integration endpoint hostname +- Cannot forward original Host header directly +- **REST API workaround**: Create custom header in Method Request, map in Integration Request: `method.request.header.host` -> `integration.request.header.my_host` +- **HTTP API workaround**: Use parameter mapping to forward the host header: `overwrite` on `integration.request.header.X-Original-Host` from `$request.header.host` diff --git a/plugins/aws-serverless/skills/api-gateway/references/deployment.md b/plugins/aws-serverless/skills/api-gateway/references/deployment.md new file mode 100644 index 0000000..d778629 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/deployment.md @@ -0,0 +1,137 @@ +# Deployment Strategies + +## Deployment Basics + +### Understanding Deployments + +A Deployment in API Gateway is an **immutable snapshot of your API configuration**, not an action. Think of it like a git commit: changes to your API (resources, methods, integrations, authorizers) are like commits to a main branch that cannot be invoked externally. To make changes callable, you create a Deployment (snapshot) and point a Stage to it. + +- **Creating a Deployment** = taking a snapshot of the current API state +- **Deploying to a Stage** = updating a stage to point to that snapshot +- Multiple stages can point to the same or different deployments +- **Console "Test Invoke" bypasses deployments**: it always uses the current API state (not a deployed snapshot). Bypasses IAM auth, Lambda authorizers, Cognito authorizers, API key validation, throttling, WAF, resource policies, and mTLS. Use [`TestInvokeAuthorizer`](https://docs.aws.amazon.com/apigateway/latest/api/API_TestInvokeAuthorizer.html) to test authorizer logic separately. This is why "it works in console but not when invoked" is a common complaint + +### REST API + +- Explicit deployment required to make changes live. Each deployment is immutable +- A stage cannot be created without a deployment (deploymentId is required in CreateStage) +- A deployment can be created without deploying to a stage (stageName is optional in CreateDeployment) +- **Always redeploy after**: changing resource policy, adding/modifying methods, updating integrations, configuring authorizers, modifying models or request validators +- **No redeployment needed for**: throttling/usage plan changes, logging configuration, caching TTL (capacity changes take effect without redeployment but cause ~4 minutes of cache unavailability during resizing), stage variable values, client certificate changes, WAF association changes (propagation takes minutes). These take effect on the stage without a new deployment +- Max 10 stages per API (adjustable via [Service Quotas](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html)) +- Stage variables: max 100 per stage, referenced as `${stageVariables.variableName}` in integration URIs and `$stageVariables.variableName` in VTL mapping templates + +### HTTP API + +- A stage can be created without a deployment (then updated via UpdateStage) +- Supports automatic deployments (AutoDeploy); changes deploy immediately +- Explicit deployments also supported for manual control +- **AutoDeploy caveat**: AutoDeploy is a **security risk**. It triggers a new deployment after each API management operation completes. When making multiple changes via separate API calls, intermediate states are briefly live. A new route may be deployed before its authorizer is attached, **exposing an unauthenticated endpoint to the internet** for seconds to minutes. Routes may also deploy before their integration or IAM role, causing 500 errors. With explicit deployments (or SAM/CDK), all configuration changes are made first, then a single deployment snapshot is created, avoiding intermediate states. **Avoid AutoDeploy in production**; it is a security and availability risk, not just an operational inconvenience + +### WebSocket API + +- A stage can be created without a deployment (then updated via UpdateStage) +- Does **not** support automatic deployments (AutoDeploy); every change requires an explicit redeployment + +### Deployment Propagation + +Changes do not propagate to all API Gateway data plane hosts simultaneously. During propagation: + +- Some hosts serve the new deployment while others still serve the old one +- If you delete a resource (e.g., Lambda function) that the old deployment references, requests hitting hosts still propagating will get 500 errors +- **Always retain old resources** until propagation completes, then remove them in a subsequent deployment + +## Canary Deployments (REST API Only) + +Route a percentage of traffic to a canary deployment for testing **API configuration changes** (not code changes): + +1. Deploy new version to a canary +2. Configure canary traffic percentage (e.g., 10%) +3. Monitor via CloudWatch Logs (`API-Gateway-Execution-Logs_/`) +4. **Promote**: "Promote Canary" replaces the stage's deployment with the canary's deployment and removes all canary settings in a single operation. All traffic then uses the new configuration. **Note**: If `useStageCache: false` (canary used a separate cache), the canary cache is discarded on promotion, causing a cache miss spike. Consider flushing the stage cache or setting short TTLs during canary testing +5. **Rollback**: **Delete the canary release** to revert all traffic to the base stage deployment. Setting the percentage to 0% merely stops canary traffic but does not remove canary settings, so it is not a proper rollback + +- Configure via `canarySettings` on a stage: `percentTraffic` (0.0–100.0), `useStageCache` (whether canary uses the stage cache or a separate one) +- **Canary releases test API Gateway configuration** (new resources, integrations, mapping templates, authorizers), not Lambda code changes. For Lambda code canary, use Lambda aliases with weighted routing +- Monitor `Latency`, `5XXError`, `4XXError` CloudWatch metrics filtered by canary stage to compare against the production baseline before promoting +- SAM: Define canary settings programmatically with `sam deploy` +- Stage variable overrides supported during canary period +- For direct service integrations (DynamoDB, SQS, etc.): canary deployments are the only way to do gradual rollouts since there are no Lambda aliases involved + +## Manual Rollback via Deployment History (REST API) + +REST APIs retain a history of all deployments. The fastest rollback mechanism is to point the stage back to a previous deployment ID using `UpdateStage`: + +``` +aws apigateway update-stage --rest-api-id --stage-name \ + --patch-operations op=replace,path=/deploymentId,value= +``` + +- Near-instant — no CloudFormation involved, no new deployment created +- Use `GetDeployments` to list available deployment IDs with creation dates +- **This is the recommended emergency rollback path** — faster than CloudFormation rollback and avoids the drift risk (pitfall #4) +- Does not affect Lambda code — only reverts API Gateway configuration (routes, integrations, authorizers, mapping templates). For Lambda code rollback, update the alias to the previous version +- **CloudFormation drift warning**: After manual rollback, the stage's deploymentId diverges from what CloudFormation tracks. The next `sam deploy` or stack update will overwrite your manual rollback with whatever deployment CloudFormation computes. Always follow up with a proper IaC deployment to re-synchronize state + +## Blue/Green Zero-Downtime Deployments + +Based on the blog https://aws.amazon.com/blogs/compute/zero-downtime-blue-green-deployments-with-amazon-api-gateway/ + +Use custom domain API mapping to switch traffic between two environments: + +### Architecture + +1. **Blue stack**: Current production REST API (separate SAM stack) +2. **Green stack**: New version REST API (separate SAM stack) +3. **Custom domain stack**: Route 53 record + ACM certificate + API Gateway custom domain with API mapping + +### Workflow + +1. Deploy blue stack +2. Deploy custom domain stack pointing to blue +3. Deploy green stack +4. Test green via its direct invoke URL +5. Update custom domain stack to activate green stack +6. Monitor and rollback by re-pointing to blue if needed (see notes on propagation below) +7. **Cleanup**: Delete the **inactive** (old) API stack first (it receives no traffic). Keep the custom domain stack and active API stack running. Only delete the custom domain stack when decommissioning the entire service. **Never delete the custom domain stack while APIs are still serving traffic**, as this removes the production endpoint immediately + +### Notes + +- **Propagation**: API mapping changes propagate within minutes (no DNS change involved; the custom domain DNS record stays the same). During propagation, some API Gateway hosts serve the old mapping while others serve the new one +- **In-flight request risk**: During the propagation window, clients may receive responses from either blue or green non-deterministically. API Gateway does not maintain request affinity. **Both blue and green must be fully functional and backward-compatible during transition.** Persist all state in the backend (DynamoDB, SQS, etc.); do not rely on in-memory state in Lambda, as a multi-step workflow may start on blue and complete on green. Verify propagation is complete by returning a version identifier from your integration and polling until 100% of responses show the new version +- **Rollback**: Re-pointing to blue has the same propagation delay as the initial switch; it is not instant. Plan for minutes of mixed traffic during rollback +- External custom domain URL never changes +- Each environment is a complete, independent API deployment + +## Routing Rules for A/B Testing (REST API Only) + +- **REST API only**. HTTP API and WebSocket API do not support routing rules (use base path mappings instead) +- Route specific users by header value to different API/stage combinations on a custom domain without Lambda +- Configure via `AWS::ApiGatewayV2::RoutingRule` resources with conditions (headers, base paths) and actions (target API + stage) +- Combine with stage variables for flexible targeting +- Zero-downtime: Start in "Routing rules then API mappings" mode, existing mappings serve as fallback +- See `references/custom-domains-routing.md` for rule structure, priority, and routing modes + +## Infrastructure as Code + +For IaC framework selection (SAM vs CDK), project setup, CI/CD pipelines, and environment management, see the [aws-serverless-deployment skill](../../aws-serverless-deployment/). + +API Gateway IaC best practices: + +- Embed OpenAPI specifications in IaC templates rather than defining APIs with IaC syntax directly +- Export OpenAPI specs from development tools, import into API Gateway +- Use IaC for all production deployments, not console + +## Deployment Pitfalls + +1. **Changes not taking effect**: Must create a new deployment for REST APIs after any change +2. **CloudFormation logical ID must change**: `AWS::ApiGateway::Deployment` is immutable. If the logical ID stays the same, CloudFormation won't create a new deployment on subsequent stack updates. SAM and CDK auto-generate unique logical IDs by hashing the API definition, but if changes are made outside the API definition (e.g., only Lambda code changed), the hash stays the same and no new deployment is created. **Fix**: Change the API description or any definition field to force a new hash +3. **Deleting old resources causes 5XX during propagation**: If CloudFormation deletes the old Lambda function (or role, alias, etc.) while API Gateway is still propagating the new deployment, hosts still pointing to the old snapshot will return 500. **Fix**: Use a two-phase deployment: (1) deploy the new resources alongside the old ones with `DeletionPolicy: Retain` on resources being replaced, (2) after propagation completes, deploy again to remove the old resources. For Lambda aliases, point the alias to the new version but keep the old version published until propagation is confirmed +4. **CloudFormation rollback creates new snapshot, not the original**: When CloudFormation rolls back, it creates a new Deployment with the current API state; it does not restore the original deployment ID. If there's stack drift (e.g., manual console changes to the API), the rollback snapshots the _drifted_ state, not the last known-good state — the operator believes they rolled back to safety but are running an untested configuration. **Mitigations**: Avoid manual/console changes to production APIs; run `aws cloudformation detect-stack-drift` before relying on rollback; set up drift detection alarms. For fastest recovery, use manual rollback via deployment history (see below) instead of CloudFormation rollback +5. **Limited deployment inspection**: `GetDeployment` returns only ID, description, and date (not the API snapshot). However, you can use `GetExport` on a stage pointing to a deployment to retrieve the full API definition as OpenAPI. To diff two deployments, export from two stages pointing to different deployments and diff the exports. Track changes primarily through IaC version control +6. **`DeploymentStatus: DEPLOYED` is misleading**: HTTP/WebSocket API reports `DEPLOYED` even for deployments never associated with any stage. The status means "snapshot created successfully", not "deployed to a stage" +7. **Stage variable Lambda permissions**: When referencing Lambda via `${stageVariables.functionName}` in an integration URI, you must manually add a resource-based invoke permission — SAM/CDK do NOT auto-generate these for stage-variable references (only for direct ARN references). Without it, every invocation returns a 500 "Internal server error" with no hint about the cause. Add permission with `aws lambda add-permission --function-name --statement-id apigw- --action lambda:InvokeFunction --principal apigateway.amazonaws.com --source-arn "arn:aws:execute-api::://*"`. **This must be repeated for every new stage and every new Lambda function referenced by stage variables** +8. **Circular dependency**: Never reference `ServerlessRestApi` or `ServerlessHttpApi` in Lambda environment variables, `Outputs`, IAM policy resources, or other resource properties — all of these create circular dependencies. For request-handling Lambda functions, derive API URL at runtime from `event["requestContext"]`. For non-request contexts (callback URLs, webhook registrations), use SSM Parameter Store — write the API URL via a post-deploy script, or use an explicit `AWS::Serverless::Api` resource which breaks the circular dependency +9. **YAML duplicate keys**: Automated template patching can silently introduce duplicate keys. Validate: `sam validate` or `python3 -c "import yaml; yaml.safe_load(open('template.yaml'))"` +10. **Management API rate limit**: Management API calls share an aggregate rate limit of 10 rps with 40-burst across all API Gateway operations in the account. Individual operations have stricter limits. `CreateDeployment` is limited to **1 request every 5 seconds** (0.2 rps, fixed, not adjustable). CI/CD pipelines deploying many APIs in parallel will be throttled. CloudFormation reports a generic error, not a throttling message. Stagger parallel deployments or use a single pipeline with sequential stages +11. **Cache flush on redeployment**: Creating a new deployment to a stage with caching enabled flushes the entire cache, causing a temporary spike in backend load ("thundering herd"). Mitigations: (a) ensure backend auto-scaling can handle full uncached load before deploying, (b) script synthetic requests to cached endpoints after deployment to pre-warm the cache, (c) use canary deployments to limit the blast radius of cache flush, (d) backends with cold-start issues (Lambda with VPC, containers scaling from zero) compound the thundering herd — use provisioned concurrency on critical paths during deployment windows. See `references/performance-scaling.md` diff --git a/plugins/aws-serverless/skills/api-gateway/references/governance.md b/plugins/aws-serverless/skills/api-gateway/references/governance.md new file mode 100644 index 0000000..4ac24d4 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/governance.md @@ -0,0 +1,226 @@ +# API Governance + +This document focuses on governance of REST APIs (API Gateway V1), as governance and compliance controls are primarily a concern in enterprise environments where REST APIs are the typical choice due to their full API management capabilities (usage plans, WAF, resource policies, request validation). + +## Governance Framework + +Four types of security controls: + +1. **Preventative**: Prevent unauthorized changes before they occur (IAM policies, SCPs) +2. **Proactive**: Prevent noncompliant resources before deployment (CloudFormation Hooks, Guard rules) +3. **Detective**: React to configuration changes after they happen (AWS Config rules, EventBridge) +4. **Responsive**: Remediate adverse events (automated remediation via Lambda) + +## Governance Tools + +### Preventative Controls + +- **IAM policies and permission boundaries**: Fine-grained API Gateway control plane access +- **Service Control Policies (SCPs)**: Organization-wide guardrails using API Gateway condition keys. Policy examples below show statement bodies only. SCPs require a complete policy document with `Version`, `Statement` array, and `"Effect"`, while SCP statements implicitly apply to all principals (`"Principal": "*"` is not needed in SCPs but is required in resource-based policies) +- **Over 20 IAM condition keys** available for API Gateway governance, split into two categories: + - `apigateway:Request/*`: Evaluates the new values being set in the request + - `apigateway:Resource/*`: Evaluates the current values on the existing resource being acted upon + - Both are preventative, evaluated at IAM authorization time before the action occurs +- Key condition keys: `apigateway:Request/EndpointType`, `apigateway:Request/SecurityPolicy`, `apigateway:Resource/ApiKeyRequired`, `apigateway:Request/AuthorizationType`, `apigateway:Request/AccessLoggingDestination`, `apigateway:Request/MtlsTrustStoreUri` + +### Proactive Controls + +- **CloudFormation Hooks**: Evaluate resource configuration before deployment. Noncompliant resources block deploy or warn +- **AWS Control Tower**: Preconfigured proactive controls for API Gateway +- **CloudFormation Guard**: Open-source policy-as-code evaluation tool with declarative DSL. Managed rule set available for API Gateway +- **Limitation**: CloudFormation-based proactive controls may not work with non-CloudFormation IaC tools (Terraform, Pulumi) + +### Detective Controls + +- **AWS Config**: Managed rules + custom rules (Guard DSL or Lambda). Key managed rules: + - `api-gw-xray-enabled` + - `api-gw-associated-with-waf` + - `api-gw-ssl-enabled` + - `api-gw-execution-logging-enabled` + - `api-gw-endpoint-type-check` +- **Amazon EventBridge**: React to API Gateway events in real-time with Lambda +- **AWS Security Hub**: Findings trigger EventBridge events for automated remediation +- **AWS Trusted Advisor**: Service-level checks for optimization, performance, security + +## Enforcing Observability + +### Require X-Ray Tracing + +- Preventative: N/A (no IAM/SCP conditions) +- Proactive: Custom CloudFormation Hook or Guard rule +- Detective: AWS Config rule `api-gw-xray-enabled` + +### Require Access Logging + +- Preventative: SCP using `apigateway:Request/AccessLoggingDestination` and `apigateway:Request/AccessLoggingFormat` + +```json +{ + "Effect": "Deny", + "Action": ["apigateway:PATCH", "apigateway:POST", "apigateway:PUT"], + "Resource": ["arn:aws:apigateway:*::/restapis/*/stages/*"], + "Condition": { "StringLikeIfExists": { "apigateway:Request/AccessLoggingDestination": "" } } +} +``` + +- **Side effect**: `StringLikeIfExists` means any stage update that does not explicitly include `AccessLoggingDestination` (e.g., updating cache settings or stage variables) will also be denied. This is intentionally strict; use detective controls instead if this is too restrictive +- Detective: AWS Config custom Guard rule: `configuration.accessLogSettings.destinationArn is_string` + +### Require Execution Logging + +- Preventative: N/A (no IAM/SCP conditions) +- Proactive: Custom CloudFormation Hook or Guard rule +- Detective: AWS Config rule `api-gw-execution-logging-enabled` + +**Caveat**: Preventative/proactive controls may block first deployment via AWS Console (Console does not allow specifying these settings on new stage creation). IaC/CLI works fine. + +## Enforcing Security + +### Require WAF + +- Preventative: N/A +- Proactive: Custom CloudFormation Hook +- Detective: AWS Config rule `api-gw-associated-with-waf` + +### Require TLS 1.2+ + +- Preventative: SCP denying `apigateway:Request/SecurityPolicy` value `TLS_1_0` +- Detective: EventBridge rule or AWS Config custom rule + +### Require mTLS + +- Preventative: SCP requiring `apigateway:Request/MtlsTrustStoreUri` is present +- Detective: EventBridge rule or AWS Config custom rule + +### Require Specific Authorizer Type + +- Preventative: SCP using `apigateway:Request/AuthorizationType` (valid values: `NONE`, `AWS_IAM`, `CUSTOM`, `COGNITO_USER_POOLS`) +- Use `apigateway:Request/AuthorizerUri` to enforce a specific Lambda authorizer. Note: the URI is the full invocation path (`arn:aws:apigateway:REGION:lambda:path/2015-03-31/functions/FUNCTION_ARN/invocations`), not just the Lambda ARN. Use `StringLike` with wildcards for flexibility + +### Require API Key + +- Preventative: SCP with `apigateway:Resource/ApiKeyRequired` or `apigateway:Request/ApiKeyRequired` +- Detective: EventBridge rule or AWS Config custom rule + +### Require Request Validation + +- Preventative: N/A +- Proactive: Custom CloudFormation Hook +- Detective: EventBridge rule or AWS Config custom rule + +### Restrict VPCs in Private API Resource Policy + +- Preventative: N/A +- Proactive: Custom CloudFormation Hook +- Detective: EventBridge rule or AWS Config custom rule + +### Audit Resource Policies for Overly Broad Access + +- Preventative: N/A (resource policy content is not exposed via IAM condition keys) +- Proactive: Custom CloudFormation Hook to reject policies with `"Principal": "*"` without `Condition` constraints +- Detective: AWS Config custom rule to flag resource policies granting unrestricted access (e.g., missing `aws:SourceVpce` or `aws:SourceIp` conditions on private APIs) + +## Enforcing Management Control + +### Freeze API Modifications by Tag + +```json +{ + "Effect": "Deny", + "Action": ["apigateway:DELETE", "apigateway:PATCH", "apigateway:POST", "apigateway:PUT"], + "Resource": "*", + "Condition": { "StringEquals": { "aws:ResourceTag/EnvironmentState": "frozen" } } +} +``` + +- To lift freeze: temporarily disable the IAM policy. Note: the `frozen` tag itself cannot be removed while the policy is active (tag operations use `apigateway:PUT` which is denied) +- Alternative: Freeze deployments by stage name using `apigateway:Request/StageName` + +### Prevent Custom Domains in Child Accounts + +```json +{ + "Effect": "Deny", + "Action": ["apigateway:DELETE", "apigateway:PUT", "apigateway:PATCH", "apigateway:POST"], + "Resource": ["arn:aws:apigateway:*::/domainnames", "arn:aws:apigateway:*::/domainnames/*"] +} +``` + +### Prevent Public APIs in Non-Central Accounts + +- SCP denying `EDGE` or `REGIONAL` endpoint types in child accounts +- Detective: AWS Config rule `api-gw-endpoint-type-check` + +### Require Tags + +```json +{ + "Effect": "Deny", + "Action": ["apigateway:POST"], + "Resource": ["arn:aws:apigateway:*::/restapis"], + "Condition": { "Null": { "aws:RequestTag/owner": "true" } } +} +``` + +- Use `aws:RequestTag` for enforcing tags at creation time; use `aws:ResourceTag` for enforcing tags on updates to existing resources +- REST API child resources of RestApi, DomainName, UsagePlan inherit parent tags + +### Require Documentation + +- Proactive: Custom CloudFormation Hook +- Detective: AWS Config custom Guard rule: `configuration.documentationVersion is_string` + +### Require Compression + +- Proactive: Custom CloudFormation Hook +- Detective: AWS Config custom Guard rule: `configuration.minimumCompressionSize >= 0` + +## Management Access Control + +### API Management (Control Plane) + +- IAM policy scoped to specific API ARN: `arn:aws:apigateway:region::/restapis/apiId/*` +- **Tip**: Use `arn:aws:apigateway:region::/restapis/??????????/*` (10 question marks) to match any API ID. REST API IDs are observed to be exactly 10 alphanumeric characters. This scopes permissions to REST APIs without hardcoding specific IDs. Note: this length is not formally documented by AWS, but the risk of it changing is low given the installed base +- This does NOT grant access to: custom domains, client certificates, VPC links, API keys, usage plans +- IAM principals denied by default + +### Observability Access + +- Execution logs: IAM for CloudWatch Logs; use data protection policies to mask sensitive data +- Access logs: IAM for CloudWatch Logs or Firehose; control access at both Firehose AND destination +- Metrics: IAM for CloudWatch +- Traces: IAM for X-Ray +- CloudTrail: IAM for CloudTrail +- Sensitive data protection: CloudWatch Logs data protection policies, Amazon Macie for S3-stored logs + +## API Lifecycle + +**Design** -> **Build** -> **Manage** -> **Adoption** + +1. **Plan**: Protocol selection, endpoint type, topology, standards +2. **Develop/Test**: IaC, OpenAPI specs, integration testing +3. **Secure**: Auth, WAF, mTLS, resource policies +4. **Deploy/Publish**: CI/CD, canary, blue/green +5. **Scale**: Quotas, caching, cell-based architecture +6. **Monitor**: Metrics, logs, traces, dashboards +7. **Insights**: CloudWatch Logs Insights, Contributor Insights, QuickSight +8. **Monetize**: Usage plans, API segmentation, AWS Marketplace, AWS Data Exchange +9. **Discover**: API Gateway Portal, Backstage, partner portals (Readme, Apiable, SmartBear) + +## Audit + +- **CloudTrail**: All REST API management calls captured as events +- **AWS Config**: Record configuration changes, detect drift, enforce compliance +- **EventBridge**: React to changes in real-time +- Example EventBridge rule for stage updates: + +```json +{ + "detail": { + "eventSource": ["apigateway.amazonaws.com"], + "requestParameters": { "restApiId": ["abcd123456"], "stageName": ["prod"] }, + "eventName": ["UpdateStage"], + "errorCode": [{ "exists": false }] + } +} +``` diff --git a/plugins/aws-serverless/skills/api-gateway/references/observability.md b/plugins/aws-serverless/skills/api-gateway/references/observability.md new file mode 100644 index 0000000..6d207e3 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/observability.md @@ -0,0 +1,514 @@ +# Observability and Monitoring + +## Table of Contents + +- [Logging](#logging) + - [Execution Logging](#execution-logging-rest-api-and-websocket) + - [Access Logging](#access-logging) + - [Log Retention](#log-retention) + - [REST API Access Log Format](#recommended-access-log-format-rest-api) + - [HTTP API Access Log Format](#http-api-access-log-format) + - [WebSocket API Access Log Format](#websocket-api-access-log-format) + - [Enhanced Observability Variables](#enhanced-observability-variables) +- [Setting Up Logging](#setting-up-logging) +- [CloudWatch Metrics](#cloudwatch-metrics) +- [CloudWatch Alarms](#cloudwatch-alarms) +- [CloudWatch Metric Filters](#cloudwatch-metric-filters) +- [X-Ray Tracing](#x-ray-tracing) +- [CloudWatch Logs Insights Queries](#cloudwatch-logs-insights-queries) +- [Additional Monitoring Tools](#additional-monitoring-tools) +- [API Analytics Pipeline](#api-analytics-pipeline) +- [Cross-Account and Centralized Logging](#cross-account-and-centralized-logging) +- [CloudTrail](#cloudtrail) + +--- + +## Logging + +### Execution Logging (REST API and WebSocket) + +- Full request/response logs including mapping template output, integration request/response, authorizer output +- Levels: OFF, ERROR, INFO +- **Log events truncated at 1,024 bytes**; use access logs for complete data +- Log group: `API-Gateway-Execution-Logs_/` (both REST and WebSocket) +- HTTP API does NOT support execution logging +- **Cost warning**: INFO-level execution logging generates many log events per request (10-60+ depending on API complexity: authorizers, mapping templates, and integration details all add entries). At scale, CloudWatch Logs costs can exceed Lambda + API Gateway costs combined. Use ERROR level in production and enable INFO only for targeted debugging + +### What API Gateway Does NOT Log + +- 413 Request Entity Too Large +- Excessive 429 throttling responses +- 400 errors to unmapped custom domains +- Internal 500 errors from API Gateway itself + +### Access Logging + +- Customizable log format using `$context` variables +- Formats: CLF, JSON, XML, CSV +- Access log template max: 3 KB +- Destinations: CloudWatch Logs or Kinesis Data Firehose (REST only for Firehose) +- HTTP API: Only access logging supported (no execution logging) +- **Delivery latency**: Access logs can be delayed by several minutes. Use CloudWatch metrics (near-real-time) for dashboards and alarms; use access logs for investigation and deep analysis + +### Log Retention + +CloudWatch Logs default to **Never Expire**, which causes unbounded storage costs. Always set retention policies: + +- **Execution logs (INFO)**: 3-7 days (debugging only, high volume) +- **Execution logs (ERROR)**: 14-30 days +- **Access logs**: 30-90 days (or longer for compliance) +- **Compliance/audit logs**: 1-3 years per organizational policy + +Define log groups explicitly in SAM/CloudFormation to control retention: + +```yaml +ApiAccessLogGroup: + Type: AWS::Logs::LogGroup + Properties: + LogGroupName: !Sub "/aws/apigateway/${MyApi}/access-logs" + RetentionInDays: 90 +``` + +### Recommended Access Log Format (REST API) + +Use this JSON format for maximum troubleshooting capability with enhanced observability variables: + +**Note**: This format is for REST APIs. HTTP API and WebSocket API use different `$context` variables; see the API-specific formats below. + +```json +{ + "requestId": "$context.requestId", + "extendedRequestId": "$context.extendedRequestId", + "ip": "$context.identity.sourceIp", + "caller": "$context.identity.caller", + "user": "$context.identity.user", + "accountId": "$context.identity.accountId", + "userAgent": "$context.identity.userAgent", + "requestTime": "$context.requestTime", + "requestTimeEpoch": "$context.requestTimeEpoch", + "httpMethod": "$context.httpMethod", + "resourcePath": "$context.resourcePath", + "path": "$context.path", + "status": "$context.status", + "protocol": "$context.protocol", + "responseLength": "$context.responseLength", + "responseLatency": "$context.responseLatency", + "integrationLatency": "$context.integrationLatency", + "domainName": "$context.domainName", + "apiId": "$context.apiId", + "stage": "$context.stage", + "error-message": "$context.error.message", + "error-responseType": "$context.error.responseType", + "waf-error": "$context.waf.error", + "waf-status": "$context.waf.status", + "waf-latency": "$context.waf.latency", + "waf-response": "$context.wafResponseCode", + "authenticate-error": "$context.authenticate.error", + "authenticate-status": "$context.authenticate.status", + "authenticate-latency": "$context.authenticate.latency", + "authorizer-error": "$context.authorizer.error", + "authorizer-status": "$context.authorizer.status", + "authorizer-latency": "$context.authorizer.latency", + "authorizer-integrationLatency": "$context.authorizer.integrationLatency", + "authorize-error": "$context.authorize.error", + "authorize-status": "$context.authorize.status", + "authorize-latency": "$context.authorize.latency", + "integration-error": "$context.integration.error", + "integration-status": "$context.integration.status", + "integration-latency": "$context.integration.latency", + "integration-requestId": "$context.integration.requestId", + "integration-integrationStatus": "$context.integration.integrationStatus" +} +``` + +Key variables explained: + +- `requestTimeEpoch`: Epoch-millisecond timestamp. Use for programmatic analysis and Athena queries +- `extendedRequestId`: Maps to `x-amz-apigw-id` header. Needed for AWS Support escalations +- `accountId`: AWS account of the caller. Critical for IAM-authenticated and cross-account APIs +- `error-message`: API Gateway's own error message (e.g., "Authorizer error", "Endpoint request timed out") +- `error-responseType`: Gateway Response type triggered (e.g., `AUTHORIZER_FAILURE`, `INTEGRATION_TIMEOUT`, `THROTTLED`). Categorizes errors without execution logs +- `integration-integrationStatus`: Status code from the Lambda service itself (usually 200 even when the function errors) +- `integration-status`: Status code from your Lambda function code (for proxy integrations) + +### HTTP API Access Log Format + +HTTP API uses different `$context` variables. Key differences from REST API: + +- Uses `$context.routeKey` instead of `$context.resourcePath` +- No WAF, authenticate, or authorize phase variables (HTTP API does not have these phases) +- Authorizer variables are available (HTTP API supports JWT and Lambda authorizers) +- No execution logging; access logs are the only log source + +```json +{ + "requestId": "$context.requestId", + "ip": "$context.identity.sourceIp", + "userAgent": "$context.identity.userAgent", + "requestTime": "$context.requestTime", + "requestTimeEpoch": "$context.requestTimeEpoch", + "routeKey": "$context.routeKey", + "path": "$context.path", + "status": "$context.status", + "protocol": "$context.protocol", + "responseLength": "$context.responseLength", + "responseLatency": "$context.responseLatency", + "integrationLatency": "$context.integrationLatency", + "domainName": "$context.domainName", + "apiId": "$context.apiId", + "stage": "$context.stage", + "error-message": "$context.error.message", + "authorizer-error": "$context.authorizer.error", + "integration-error": "$context.integration.error", + "integration-status": "$context.integration.status", + "integration-latency": "$context.integration.latency", + "integration-integrationStatus": "$context.integration.integrationStatus" +} +``` + +### WebSocket API Access Log Format + +WebSocket APIs use connection-oriented variables instead of HTTP method/path: + +```json +{ + "requestId": "$context.requestId", + "extendedRequestId": "$context.extendedRequestId", + "connectionId": "$context.connectionId", + "eventType": "$context.eventType", + "routeKey": "$context.routeKey", + "connectedAt": "$context.connectedAt", + "requestTime": "$context.requestTime", + "requestTimeEpoch": "$context.requestTimeEpoch", + "ip": "$context.identity.sourceIp", + "userAgent": "$context.identity.userAgent", + "accountId": "$context.identity.accountId", + "status": "$context.status", + "domainName": "$context.domainName", + "apiId": "$context.apiId", + "stage": "$context.stage", + "error-message": "$context.error.message", + "error-responseType": "$context.error.responseType", + "authorizer-error": "$context.authorizer.error", + "authorizer-status": "$context.authorizer.status", + "authorizer-latency": "$context.authorizer.latency", + "authorizer-integrationLatency": "$context.authorizer.integrationLatency", + "integration-error": "$context.integration.error", + "integration-status": "$context.integration.status", + "integration-latency": "$context.integration.latency", + "integration-requestId": "$context.integration.requestId" +} +``` + +Key WebSocket-specific variables: + +- `connectionId`: Unique ID for the persistent WebSocket connection +- `eventType`: `CONNECT`, `MESSAGE`, or `DISCONNECT` +- `routeKey`: The matched route (`$connect`, `$disconnect`, `$default`, or custom route keys) +- `connectedAt`: Epoch timestamp when the connection was established + +### Enhanced Observability Variables + +API Gateway divides REST API requests into phases: **WAF -> Authenticate -> Authorizer -> Authorize -> Integration** + +Each phase exposes `$context.{phase}.status`, `$context.{phase}.latency`, and `$context.{phase}.error`. + +**Note on authorizer phase**: The authorizer has both `$context.authorizer.latency` (total authorizer latency) and `$context.authorizer.integrationLatency` (time spent in the authorizer Lambda/Cognito call). The difference is API Gateway overhead for the authorizer phase. + +**Diagnosing 403 errors by phase:** + +- `$context.waf.status: 403` = WAF blocked the request +- `$context.authenticate.status: 403` = Invalid credentials (e.g., malformed SigV4) +- `$context.authorizer.status: 403` = Lambda authorizer returned Deny policy +- `$context.authorize.status: 403` with `$context.authorize.error: "The client is not authorized"` = Valid credentials but insufficient permissions (resource policy or IAM policy denied) + +**Key distinction (Lambda proxy integration):** + +- `$context.integration.integrationStatus`: Status code from the Lambda **service** (usually 200 even when the function throws an error) +- `$context.integration.status`: Status code from your Lambda **function code** (the `statusCode` field in your function's response) + +### Additional Access Log Variables + +- `$context.identity.apiKey`: Track which API keys are making requests +- `$context.identity.accountId`: Identify which AWS account is calling (IAM auth, cross-account) +- `$context.domainName`: Differentiate traffic across custom domains +- `$context.customDomain.routingRuleIdMatched`: Track routing rule matches +- `$context.tlsVersion`, `$context.cipherSuite`: Monitor TLS migration +- `$context.authorizer.principalId`: Principal from Lambda authorizer (for user-level tracing) +- `$context.authorizer.claims.sub`: Cognito user pool subject claim (for Cognito-authenticated APIs) +- Response streaming (REST only): `$context.integration.responseTransferMode`, `$context.integration.timeToAllHeaders`, `$context.integration.timeToFirstContent` + +## Setting Up Logging + +### Prerequisites for REST API and WebSocket + +1. Create IAM role with `AmazonAPIGatewayPushToCloudWatchLogs` managed policy +2. Set CloudWatch log role ARN in API Gateway Settings (region-level, one-time configuration) +3. Enable logging per stage + +### Prerequisites for HTTP API + +HTTP APIs do **not** use the account-level CloudWatch log role. Instead: + +1. Create the CloudWatch Logs log group +2. Specify the log group ARN when configuring the stage's access logging +3. API Gateway uses a service-linked role to write logs. Ensure the log group's resource-based policy allows `logs:CreateLogStream` and `logs:PutLogEvents` from the API Gateway service principal + +### Missing Logs Troubleshooting + +- IAM permissions incorrect (most common for REST/WebSocket) +- Log group resource policy missing (most common for HTTP API) +- Logging not enabled at stage level +- Method-level override disabling logging +- Log group does not exist (create it first or let API Gateway create it) + +## CloudWatch Metrics + +| Metric | Description | +| -------------------- | ----------------------------------------------------------------------------------------------------------------------- | +| `Count` | Total API requests | +| `Latency` | Time from API Gateway receiving the request to returning the response (does not include client-to-gateway network time) | +| `IntegrationLatency` | Time spent in backend integration | +| `4XXError` / `4xx` | Client error count. REST API: `4XXError`; HTTP API: `4xx` | +| `5XXError` / `5xx` | Server error count. REST API: `5XXError`; HTTP API: `5xx` | +| `CacheHitCount` | Cache hits (REST only) | +| `CacheMissCount` | Cache misses (REST only) | +| `DataProcessed` | Amount of data processed in bytes (HTTP API only) | + +- Default: metrics per API stage +- Detailed metrics: per method (enable on stage) +- Use CloudWatch Embedded Metric Format for business-specific metrics + +## CloudWatch Alarms + +### Recommended Alarms + +Always configure these alarms for production APIs: + +**Error rate alarms:** + +- `5XXError` rate > 1% of total requests: server errors indicate backend or configuration problems +- `4XXError` rate anomaly detection: spikes indicate breaking changes, auth failures, or abuse +- `IntegrationLatency` p99 > SLA threshold: detect backend degradation before timeouts + +**Throttling alarm:** + +- `Count` approaching account throttle limit (10,000 rps default). Alert at 80% utilization to request limit increases proactively + +**Cache alarms (REST API):** + +- Cache hit ratio (`CacheHitCount / (CacheHitCount + CacheMissCount)`) drop below threshold: indicates cache invalidation issues or misconfiguration + +### Alarm Examples (CloudFormation) + +```yaml +# REST API alarms: use ApiName dimension and 5XXError/4XXError metric names +# HTTP API alarms: use ApiId dimension and 5xx/4xx metric names instead +Api5xxAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${AWS::StackName}-api-5xx-errors" + MetricName: 5XXError + Namespace: AWS/ApiGateway + Dimensions: + - Name: ApiName + Value: !Ref MyApi + Statistic: Sum + Period: 60 + EvaluationPeriods: 3 + Threshold: 5 + ComparisonOperator: GreaterThanThreshold + TreatMissingData: notBreaching + AlarmActions: + - !Ref AlertSnsTopic + +ApiLatencyAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${AWS::StackName}-api-p99-latency" + MetricName: Latency + Namespace: AWS/ApiGateway + Dimensions: + - Name: ApiName + Value: !Ref MyApi + ExtendedStatistic: p99 + Period: 300 + EvaluationPeriods: 3 + Threshold: 5000 + ComparisonOperator: GreaterThanThreshold + TreatMissingData: notBreaching + AlarmActions: + - !Ref AlertSnsTopic +``` + +### Composite Alarms + +Combine signals to reduce noise: + +- High 5xx AND high latency = likely backend failure (page on-call) +- High 4xx only = likely client-side issue (lower priority) + +## CloudWatch Metric Filters + +Create custom CloudWatch metrics from access log patterns. Metric filters run on the log group and extract numeric values or count pattern matches. + +### Error Count by Response Type + +``` +{ $.["error-responseType"] = "THROTTLED" } +``` + +Publishes a metric counting throttled requests. Useful since excessive 429s may not be logged by API Gateway itself. + +### Slow Requests + +``` +{ $.responseLatency > 5000 } +``` + +Counts requests exceeding 5 seconds. Can alarm on this custom metric for tighter latency SLOs than the built-in p99. + +### Requests by API Key + +Add `"apiKey": "$context.identity.apiKey"` to your log format first, then use: + +``` +{ $.apiKey != "-" } +``` + +Use with metric dimensions to track per-consumer request volumes. + +## X-Ray Tracing + +- **REST API**: Active tracing supported; enable per stage. API Gateway creates the trace segment and adds trace headers to integration requests +- **HTTP API**: X-Ray tracing is [not supported](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-vs-rest.html). For distributed tracing, enable X-Ray active tracing on downstream Lambda functions and correlate using the `$context.integration.requestId` access log variable +- Configure sampling rules to control costs and recording criteria +- Service map for latency visualization +- Cross-account tracing requires CloudWatch Observability Access Manager (OAM) configuration between monitoring and source accounts + +### Enabling X-Ray in SAM/CloudFormation + +```yaml +# REST API with SAM implicit API +Globals: + Api: + TracingEnabled: true + +# Explicit REST API stage +MyApiStage: + Type: AWS::ApiGateway::Stage + Properties: + TracingEnabled: true + StageName: prod + RestApiId: !Ref MyApi +``` + +For end-to-end distributed tracing, enable X-Ray in both API Gateway and downstream Lambda functions (`Tracing: Active` in SAM function properties). Use X-Ray Groups to filter traces by error, fault, or latency thresholds. + +## CloudWatch Logs Insights Queries + +### Find 5xx Errors + +``` +fields @timestamp, status, requestId, ip, resourcePath, integrationLatency +| filter status >= 500 +| sort @timestamp desc +| limit 100 +``` + +### Latency Analysis + +``` +fields @timestamp, responseLatency, integrationLatency, resourcePath +| stats avg(responseLatency) as avgLatency, max(responseLatency) as maxLatency, + avg(integrationLatency) as avgIntegration by resourcePath +| sort avgLatency desc +``` + +### Top Talkers + +``` +fields ip +| stats count(*) as requestCount by ip +| sort requestCount desc +| limit 20 +``` + +### Per-Domain Analytics + +``` +filter domainName like /(?i)(api.example.com)/ +| stats count(*) as requests, avg(responseLatency) as avgLatency by resourcePath +| sort requests desc +``` + +### Diagnose 403 Errors by Phase + +``` +fields @timestamp, requestId, ip, resourcePath +| filter status = 403 +| stats count(*) as cnt + by coalesce(`waf-status`, "-") as waf, + coalesce(`authenticate-status`, "-") as authn, + coalesce(`authorizer-status`, "-") as authzr, + coalesce(`authorize-status`, "-") as authz +| sort cnt desc +``` + +### Find Specific Gateway Response Types + +``` +fields @timestamp, requestId, `error-responseType`, `error-message`, status +| filter ispresent(`error-responseType`) +| stats count(*) as cnt by `error-responseType` +| sort cnt desc +``` + +## Additional Monitoring Tools + +- **CloudWatch Synthetics**: Canaries for synthetic monitoring of endpoints on schedule +- **CloudWatch Application Insights**: Automated dashboards for problem detection +- **CloudWatch Contributor Insights**: Find top talkers and contributors; pre-built sample rules for API Gateway +- **CloudWatch Dashboards**: Include dashboard definitions in IaC templates +- **CloudWatch ServiceLens**: Integrates traces, metrics, logs, alarms, resource health + +## CloudWatch Embedded Metrics Format + +- Include metric data in structured logs sent to CloudWatch Logs +- CloudWatch extracts metrics automatically (**cheaper than PutMetricData API**) +- Use for custom business metrics (e.g., orders per minute, revenue per endpoint) +- Include dashboard definitions in IaC templates with both operational and business metrics + +## AI-Assisted Operations + +- **CloudWatch AI Operations**: Specify a time window, it correlates logs across services and generates root cause hypothesis +- **Amazon Q CLI**: Natural language troubleshooting ("Why do I see increased 500 errors from API Gateway in this stack?") +- **CloudWatch Logs Insights**: Supports natural language to query translation and auto-generated pattern summaries + +## API Analytics Pipeline + +For deep analytics beyond CloudWatch dashboards: + +1. Stream access logs via Amazon Data Firehose +2. Enrich with Lambda transformation (add business context, geo-IP lookup) +3. Store in S3 (partitioned by date/API/stage) +4. Query with Amazon Athena +5. Visualize with Amazon QuickSight + +**Cost tip**: Firehose-to-S3 ingestion (~$0.029/GB) is significantly cheaper than CloudWatch Logs ingestion (~$0.50/GB). For high-volume APIs, stream access logs to Firehose instead of CloudWatch Logs and query with Athena. Use CloudWatch Logs for execution logs (lower volume) and real-time Logs Insights queries. + +## Cross-Account and Centralized Logging + +For multi-account AWS Organizations setups: + +1. **CloudWatch cross-account observability**: Use Observability Access Manager (OAM) to share metrics, logs, and traces from source accounts to a central monitoring account. Enables unified dashboards and alarms across all API Gateways +2. **Subscription filters**: Stream access logs from each account to a central Kinesis Data Stream or Firehose in the monitoring account for aggregated analysis +3. **Consistent log group naming**: Use a standard naming convention across accounts (e.g., `/aws/apigateway///access-logs`) to simplify cross-account queries and cost attribution + +## CloudTrail + +- Captures all API Gateway management calls as control plane events +- Does NOT log data plane events (actual API requests); use access logs for that +- Determines: request made, IP address, who made it, when +- Use for audit and compliance, not operational monitoring +- **Do not forget CloudTrail for control plane audit**: who changed API configuration and when diff --git a/plugins/aws-serverless/skills/api-gateway/references/performance-scaling.md b/plugins/aws-serverless/skills/api-gateway/references/performance-scaling.md new file mode 100644 index 0000000..647f45b --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/performance-scaling.md @@ -0,0 +1,162 @@ +# Performance and Scaling + +## Throttling + +### Account-Level Defaults + +**Note**: Throttling values below are **default quotas**; most are adjustable via AWS Support or Service Quotas console. Do not use defaults for capacity planning without checking your account's current limits. See [latest quotas](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html). + +- **10,000 requests per second** steady-state across all REST APIs, HTTP APIs, WebSocket APIs, and WebSocket callback APIs in a region (shared quota) +- **5,000 burst** capacity (token bucket algorithm) +- These are defaults; request increases via AWS Support or Service Quotas + +### Stage-Level and Method-Level + +- Stage/method-level default throttle: configure via MethodSettings on the stage (REST API) +- Per-consumer throttle: configure via usage plans (REST API only) +- Method-level throttling overrides stage-level +- Max method-level throttle settings per stage: 20 +- Format: `resourcePath/httpMethod` (e.g., `/pets/GET`) + +### HTTP API Throttling + +- Route-level throttling only (no usage plans or API keys) +- Configure via stage settings +- Limits apply globally across all callers, with no per-consumer throttling. For per-consumer rate limiting on HTTP API, implement in a Lambda authorizer or backend + +### Usage Plans (REST API Only) + +- **Quota**: Requests per day, week, or month +- **Throttle**: Rate (requests/second) + burst +- **RPS limits are per API key**, not split across keys. If a usage plan allows 100 rps and has 10 keys, each key gets 100 rps (not 10 rps each) +- Combine with API keys to track and limit per-consumer usage +- Max 300 usage plans per region (adjustable), 10,000 API keys per region +- **API key source**: `HEADER` (default, `x-api-key`) or `AUTHORIZER` (Lambda returns key in `usageIdentifierKey`) +- **Do not associate one API key with multiple usage plans** that cover the same API stage; API Gateway picks one plan non-deterministically. One key per plan per stage is safe; a usage plan can have many keys + +### Token Bucket Algorithm + +- Bucket size = burst capacity (5,000 tokens). Refill rate = steady-state rate (10,000 tokens/second) +- Each request consumes one token. If the bucket is empty, the request is throttled (429) +- The burst capacity (5,000) is the maximum number of requests that can be served in a single instant. The steady-state rate (10,000 rps) is the maximum sustained throughput. Burst is lower than steady-state because the bucket refills faster (10,000/s) than it can be drained in one instant (5,000). Over any one-second window you can sustain 10,000 rps, but an instantaneous spike cannot exceed 5,000 concurrent requests +- Throttled requests receive 429 Too Many Requests + +## Caching (REST API Only) + +### Configuration + +- Cache sizes: 0.5 GB, 1.6 GB, 6.1 GB, 13.5 GB, 28.4 GB, 58.2 GB, 118 GB, 237 GB +- **Default TTL**: 300 seconds +- **Max TTL**: 3,600 seconds +- **TTL=0**: Disables caching +- **Max cached response size**: 1,048,576 bytes (1 MB) +- Only **GET methods** cached by default +- Cache is **best-effort**, not guaranteed to cache every response +- Cache charges apply per hour regardless of usage; only provision when you have a clear caching use case + +### Cache Keys + +- Default: resource path +- Add headers, query strings, and path parameters as additional cache keys +- More cache keys = more granular caching but lower hit rate +- Include client identity into cache keys to avoid data leaks across clients + +### Cache Invalidation + +- Client sends `Cache-Control: max-age=0` header +- Can require authorization for invalidation requests +- Entire stage cache can be flushed via console or API +- **Automatic flush on redeployment**: Creating a new deployment to a stage flushes the entire cache, causing a temporary backend load spike ("thundering herd"). See `references/deployment.md` for mitigations + +### Cache Encryption + +- Encryption at rest available as option when provisioning cache + +### Metrics + +- `CacheHitCount`, `CacheMissCount` in CloudWatch +- Monitor miss rate to determine if cache size is adequate + +### Capacity Selection + +1. Run a load test against the API +2. Monitor `CacheHitCount`, `CacheMissCount`, `Latency` +3. Start with smaller cache, scale up based on miss rates +4. Cache resizing takes time; plan ahead of peak traffic + +## Scaling Considerations + +### API Gateway Scales Automatically + +- Managed service, no capacity provisioning needed +- Be aware of service quotas and request increases proactively + +### Scale the Entire Stack + +- No point having high API Gateway quotas if backend cannot handle the load +- Consider: Lambda concurrency limits, DynamoDB provisioned capacity, RDS connection limits, ECS/EKS scaling policies +- **Automatic quota management** via AWS Service Quotas for proactive adjustment + +### Strategies for Global Scale + +- **Edge-optimized endpoints**: Route to nearest CloudFront POP. **Note**: Edge-optimized endpoints do NOT cache at the edge; they only route through CloudFront POPs to optimize TCP connections. For edge caching, use a separate CloudFront distribution in front of a Regional API +- **Self-managed CloudFront distribution**: More control over caching, WAF, and custom behaviors. This is the only way to get actual edge caching +- **Multi-region deployment**: Active-active with Route 53 latency-based routing + +### Multi-Layer Caching Strategy + +For maximum performance, layer caches: + +1. **CloudFront**: Edge caching (reduces latency, load, AND cost, since the request never reaches API Gateway) +2. **API Gateway cache** (REST only): Regional caching (reduces latency and load but NOT cost, as the request is still counted) +3. **Application-level cache**: ElastiCache or DAX for database query caching + +- CloudFront caching should be the first choice, as it provides the most benefit + +### API Gateway Billing Notes + +- Lambda authorizer invocations are billed by Lambda even if the request is ultimately rejected by the authorizer or by throttling. This is the "Distributed Denial of Wallet" vector below + +### Load Shedding + +- Use API Gateway request validation to reject invalid requests early (before hitting backend) +- Configure appropriate throttle limits per consumer tier +- Use WAF rate-based rules for DDoS protection +- **"Distributed Denial of Wallet" risk**: Without WAF, DDoS traffic invokes Lambda authorizers for every request, driving up Lambda costs (even though API Gateway itself doesn't charge for the rejected requests). WAF blocks malicious traffic before it reaches the authorizer + +### Cell-Based Architecture + +- Use multi-account approaches for blast radius control +- Each cell has its own API Gateway, Lambda, and database +- Route traffic to cells via custom domain and routing rules + +## Payload Compression + +### API Gateway Native Compression + +- `minimumCompressionSize`: Set the smallest payload size (in bytes) to compress automatically. Range: 0 bytes to 10 MB +- **Test with real payloads**: compressing very small payloads can actually increase the final size. Find the optimal threshold for your data +- Works bidirectionally: API Gateway decompresses incoming requests (client sends `Content-Encoding` header) before applying mapping templates, and compresses outgoing responses (client sends `Accept-Encoding: gzip`) after applying response mapping templates +- Most effective for text-based formats (JSON, XML, HTML, YAML). Binary data (PDF, JPEG) compresses poorly +- Set on the API level (REST API and HTTP API) +- **Benchmark**: 1 MB JSON payload compressed to 220 KB (78% reduction), response latency improved by 110 ms + +### Compressed Passthrough to Lambda + +- With native compression, API Gateway decompresses payloads before delivering to Lambda, so the decompressed payload is still subject to Lambda's 6 MB synchronous invoke limit +- To bypass this limit, configure `binaryMediaTypes: ["application/gzip"]` so API Gateway passes compressed payloads directly to Lambda without decompressing +- Lambda then handles decompression in function code, enabling transport of payloads several times larger than the 6 MB limit +- Lambda returns compressed responses with `isBase64Encoded: true` and `Content-Encoding: gzip` headers + +### Compression Trade-offs + +- Compression is CPU-intensive in Lambda, adding ~124 ms for 1 MB JSON on 1 GB ARM architecture +- Always benchmark with payloads representative of your workload before enabling + +## Handling Large Payloads + +- **10 MB API Gateway limit** (REST and HTTP): For payloads exceeding this, use S3 presigned URLs. Client uploads/downloads directly to S3, API returns the presigned URL +- **6 MB Lambda synchronous invoke limit**: Use compressed passthrough (binary media types) to transport larger payloads, or use S3 presigned URLs +- **Response streaming** (REST API only): Supports up to 15-minute sessions, first 10 MB unrestricted then bandwidth-limited to 2 MB/s. Useful for LLM responses and large datasets +- **Lambda Function URLs**: Response streaming removes the 6 MB buffered response limit; streamed responses can be much larger (subject to function timeout and bandwidth) +- For SQS/EventBridge/Lambda async invocations (1 MB limit): Use compression or store payload in S3 and pass a reference in the message diff --git a/plugins/aws-serverless/skills/api-gateway/references/pitfalls.md b/plugins/aws-serverless/skills/api-gateway/references/pitfalls.md new file mode 100644 index 0000000..7876bd7 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/pitfalls.md @@ -0,0 +1,37 @@ +# Additional API Gateway Pitfalls + +These supplement the critical pitfalls listed in the main skill file. Consult when designing or debugging API Gateway configurations. + +## Header Handling + +- **API Gateway drops/remaps certain headers**: `Authorization` conditionally dropped on requests (when containing SigV4 signature or using IAM auth), `Host` overwritten on requests with integration endpoint hostname, `Content-MD5` dropped on requests. Plan accordingly for header passthrough + +## URL Encoding + +- **Pipe `|` and curly braces `{}` must be URL-encoded** in REST API query strings. **Semicolons `;` must be URL-encoded** in HTTP and WebSocket API query strings (they cause data splitting) + +## Throttling + +- **Throttle limits are best-effort, not hard guarantees**. Brief spikes above limits may occur + +## Caching + +- **Cache charges apply even when cache is empty**. Only enable caching when you have a clear use case +- **Edge-optimized endpoints do NOT cache at the edge**. They only route through CloudFront POPs for optimized TCP connections. For actual edge caching, use a separate CloudFront distribution with a Regional API + +## Usage Plans and API Keys + +- **Do not associate one API key with multiple usage plans** covering the same API stage; API Gateway picks one plan non-deterministically +- **Usage plan RPS limits are per API key**: 100 rps with 10 keys means each key gets 100 rps, not 10 rps each + +## Logging Costs + +- **Execution logging at INFO level generates many log events per request** (10-60+ depending on API complexity). CloudWatch Logs costs can exceed Lambda + API Gateway combined at scale. Use ERROR level in production + +## Canary Deployments + +- **Canary deployments test API Gateway deployment snapshots** (resources, integrations, mapping templates, authorizers), not Lambda code directly. Stage variable overrides can route canary traffic to different Lambda aliases. For Lambda code canary without API changes, use Lambda aliases with weighted routing + +## Management API + +- **Management API rate limit: 10 rps / 40 burst**. Heavy automation can hit this diff --git a/plugins/aws-serverless/skills/api-gateway/references/requirements-gathering.md b/plugins/aws-serverless/skills/api-gateway/references/requirements-gathering.md new file mode 100644 index 0000000..ac8c6d4 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/requirements-gathering.md @@ -0,0 +1,275 @@ +# API Requirements Gathering Guide + +When helping users define requirements for APIs built on Amazon API Gateway, guide them through a structured process. Ask one question at a time; do not overwhelm with lists of questions. + +## Workflow + +1. Start with API purpose and overview (including multi-tenancy, topology, cost) +2. Determine the right API type (REST, HTTP, WebSocket) based on requirements +3. Progress through each category systematically (skip WebSocket section if not applicable) +4. Ask clarifying questions for gaps +5. Generate a final requirements summary for confirmation + +## Requirements Categories + +### 1. API Purpose and Overview + +- What problem does the API solve? +- Who are the primary consumers (browsers, mobile apps, other services, IoT devices, AI agents)? +- Expected usage volume (requests per day/hour/second)? +- Is this a public API, internal API, or partner API? +- Will AI agents or LLMs consume this API? (Affects documentation depth, error message design, pagination style, and monetization; see architecture-patterns.md "Designing APIs for AI Agent Consumption") +- Is this a multi-tenant API? (Affects throttling, isolation, and architecture; see architecture-patterns.md "Multi-Tenant SaaS") + - Per-tenant throttling tiers needed (bronze/silver/gold)? (Requires REST API usage plans) + - Tenant isolation level? (Throttling only, or compute isolation via Lambda `TenantIsolationMode: PER_TENANT`?) + - Noisy-neighbor prevention requirements? +- Account topology? (Single account for all APIs, separate accounts per domain/application, or central API account with centralized governance? Affects cross-account access, observability aggregation, and custom domain management) +- Cost sensitivity or budget constraints? (HTTP API is significantly cheaper than REST API but has fewer features: no caching, no WAF, no usage plans, no request validation, no VTL, hard 30s timeout. Choose based on feature needs vs cost) + +### 2. Endpoints and Operations + +- Resources to expose (users, orders, products, etc.)? +- Operations per resource (GET, POST, PUT, DELETE, PATCH)? +- URL paths and naming conventions? +- Nested resources or relationships? +- Query parameters for filtering, sorting, pagination? +- API versioning strategy? (URL path `/v1/` is simplest; header-based `X-API-Version` via routing rules is cleanest for REST API; query parameter `?version=1` is least recommended) +- How many concurrent API versions to support? +- Deprecation and sunsetting plan for old versions? + +### 3. Request/Response Specifications + +- Data sent in request bodies? +- Required or optional headers? +- Query parameters needed? +- Response format (JSON, XML, binary)? +- Binary content types? (Images, PDFs, files require `binaryMediaTypes` configuration; avoid `*/*` wildcard as it breaks Lambda proxy JSON responses) +- Expected response codes for success and error scenarios? +- Need for multi-value query parameters or headers? + +### 4. Data Models and Schemas + +- Domain entities and their attributes? +- Data types for each field? +- Required vs optional fields? +- Validation rules and constraints? +- Enumerations or fixed value sets? +- Need for request body validation? (REST API only supports JSON Schema draft 4) + +### 5. Authentication and Authorization + +- Authentication method? + - **IAM (SigV4)**: Best for AWS-to-AWS service calls + - **Lambda authorizer**: Custom logic, third-party IdPs, bearer tokens + - **JWT authorizer**: HTTP API only, automatic OIDC/OAuth 2.0 validation + - **Cognito user pools**: REST API native, or JWT authorizer on HTTP API + - **API keys**: Not for primary auth; use for throttling/metering with usage plans + - **mTLS**: Certificate-based, good for B2B and IoT +- Authorization model (RBAC, ABAC, resource-based)? +- Different permission levels or user roles? +- IP whitelisting or VPC restrictions needed? +- Cross-account access requirements? + +### 6. Integration Requirements + +- Backend services (Lambda, DynamoDB, RDS, ECS/EKS, on-premises)? +- Direct AWS service integrations without Lambda? REST API supports any AWS service via VTL mapping templates (SQS, EventBridge, Step Functions, S3, DynamoDB, etc.). HTTP API supports a subset via [first-class integrations](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-develop-integrations-aws-services.html) with parameter mapping (SQS, EventBridge, Step Functions, Kinesis, AppConfig) +- VPC integrations for private resources? (VPC Link required; REST uses NLB, HTTP uses ALB/NLB/Cloud Map) +- Does the API need to be completely private (VPC-only access)? (REST API only; requires `execute-api` VPC endpoint + resource policy) +- Private API as proxy to external APIs? (Provides centralized logging, throttling, and access control for outbound calls from isolated VPCs without NAT gateway; security teams must be aware of this egress path) +- On-premises integrations? (Requires VPN/Direct Connect + NLB + VPC Link. Consider connectivity stability and health check tuning) +- Data transformations needed between request/response and backend? +- Need for response streaming (large payloads, LLM responses)? (REST API only; max 15-minute sessions, first 10 MB unrestricted then 2 MB/s bandwidth limit. No VTL response transformation, no caching with streaming) +- Binary data handling needed (images, PDFs, files)? (Configure `binaryMediaTypes`; avoid `*/*` wildcard as it breaks Lambda proxy JSON responses) +- File upload/download requirements? (Direct through API Gateway up to 10 MB, or presigned S3 URLs for larger files) +- Synchronous or asynchronous processing? (REST API default 29s timeout, increasable up to 300s for Regional/Private via quota request. HTTP API has hard 30s limit. For longer operations or better UX, consider async patterns: SQS, EventBridge, Step Functions) + +### 7. WebSocket Requirements (if WebSocket API selected) + +- Route selection expression? (e.g., `$request.body.action` for JSON messages) +- Custom routes beyond `$connect`/`$disconnect`/`$default`? +- Session management strategy? (Store connectionId with user ID in DynamoDB on `$connect`; GSI on user ID for targeted messaging) +- Message patterns? (Request-response, server push/broadcast, targeted push to specific users) +- Expected concurrent connections and message throughput? +- Client resilience requirements? (Automatic reconnect with exponential backoff is mandatory; 2-hour max connection duration, 10-minute idle timeout require client-side handling) +- Heartbeat/keep-alive strategy? (Send periodic messages every 5-9 minutes to prevent idle timeout) +- Connection state recovery on reconnect? (Re-authenticate, re-subscribe to topics, restore application state) +- Backend message delivery? (Lambda via `@connections` Management API; handle `GoneException` for stale connections) +- Multi-region WebSocket? (ConnectionId is region-specific; cross-region message propagation via EventBridge or DynamoDB Streams) + +### 8. Performance and Scalability + +- Expected peak request rates (per second)? (Account default: 10,000 rps / 5,000 burst across all APIs in a region, adjustable via Service Quotas) +- Latency requirements (target response time)? +- Need for response caching? (REST API only, TTL 0-3600s, 0.5-237 GB) +- Multi-layer caching strategy? (CloudFront edge → API Gateway regional → application-level) +- Throttling requirements (rate limit, burst limit)? +- Different throttling tiers for different consumers? (Requires REST API usage plans for per-API-key rate limits) +- Expected payload sizes? (Max 10 MB for REST/HTTP, consider presigned S3 URLs for larger files, or compressed passthrough via binary media types to exceed Lambda's 6 MB limit) +- Large payload strategy? (Presigned S3 URLs for >10 MB; response streaming for REST API up to 15-minute sessions; compressed passthrough for >6 MB Lambda payloads) +- Need for payload compression? (`minimumCompressionSize` on REST API; reduces bandwidth, latency, and data transfer costs through NAT Gateway/VPC Endpoints) +- Per-tenant or per-consumer throttling tiers? (Requires REST API usage plans; HTTP API has no per-consumer throttling natively) + +### 9. Error Handling + +- Custom error response format? +- CORS headers needed on error responses? +- Specific gateway response customizations? +- How to communicate validation errors? + +### 10. Observability + +- Execution logging level (ERROR recommended for production, INFO for debugging)? (REST/WebSocket only; HTTP API does not support execution logging) +- Custom access log format requirements? (Use enhanced observability variables for phase-level troubleshooting) +- AWS X-Ray tracing needed? (REST API only) +- Custom CloudWatch metrics? (Consider CloudWatch Embedded Metrics Format for business metrics) +- Alerting requirements (latency thresholds, error rate, throttle count)? +- API analytics pipeline needed? (Firehose → S3 → Athena → QuickSight for deep analytics beyond CloudWatch) + +### 11. Security Requirements + +- AWS WAF needed? (REST API direct; HTTP API via CloudFront + WAF) +- Which WAF managed rules? (Core Rule Set, SQL injection, Known Bad Inputs, IP Reputation at minimum) +- CORS configuration (which origins, methods, headers)? Don't forget CORS headers on gateway error responses (DEFAULT_4XX, DEFAULT_5XX) +- TLS version requirements (TLS 1.2 minimum recommended)? +- mTLS for client certificate authentication? (REST/HTTP native on custom domain, or CloudFront viewer mTLS for any API type) +- Certificate revocation checking needed? (Lambda authorizer + DynamoDB, or CloudFront Connection Functions + KeyValueStore) +- Data encryption requirements? (Cache encryption at rest is off by default) +- Compliance requirements (HIPAA, PCI-DSS, SOC2)? +- Need to disable default execute-api endpoint? (Force traffic through custom domain) +- DDoS protection considerations? (WAF rate-based rules, Shield) + +### 12. Deployment, Environment, and Testing + +- How many environments (dev, staging, production)? +- Separate stacks per environment (full isolation) or stages within one API? +- Canary deployment needed? (REST API only; tests API configuration, not Lambda code) +- Blue/green deployment strategy? (Custom domain API mappings for zero-downtime switching) +- Stage variables for environment-specific config? +- CI/CD pipeline (CodePipeline, GitHub Actions, GitLab CI, etc.)? +- IaC tool (SAM, CDK, CloudFormation, Terraform)? +- Local testing approach? (`sam local start-api` supports Lambda proxy/non-proxy only; direct service integrations require a deployed dev stage or API Gateway console Test Invoke) +- Integration testing strategy? (Deployed dev stage, contract testing, synthetic canaries) +- Load/performance testing plans? (Identify bottlenecks before production; test full stack, not just API Gateway) + +### 13. Custom Domain and Routing + +- Custom domain name needed? +- Endpoint type? Edge-optimized (default for global clients; optimizes TCP connections but does not cache at edge), Regional (same-region clients, or pair with own CloudFront distribution when edge caching, edge compute, granular WAF control, or geo-based routing is needed), Private (VPC-only access, REST API only) +- Multiple APIs behind one domain? REST API: use routing rules (preferred, supports header-based routing) or base path mappings. HTTP API/WebSocket: use base path mappings (routing rules are REST API only) +- Header-based routing needed? (API versioning, A/B testing, tenant routing; requires routing rules, REST API only) +- Multi-region deployment? (Active-passive failover or active-active with Route 53 latency-based routing) +- Data consistency requirements for multi-region? (DynamoDB Global Tables uses last-writer-wins; conditional writes do NOT prevent cross-region conflicts. For conflict-sensitive data, use single-region write routing or commutative operations. Data sovereignty concerns may require geo-based routing instead of latency-based) +- Cross-account topology? (Central API account with shared domain, or per-team subdomains) + +### 14. Governance and Compliance + +- Organization-wide API standards to enforce? (SCPs for preventative, CloudFormation Hooks for proactive, AWS Config for detective controls) +- Required tags on API resources? +- Control plane access restrictions? (Who can create, modify, deploy APIs) +- Audit requirements? (CloudTrail for control plane, access logs for data plane) +- API documentation and developer portal needed? +- API lifecycle management (versioning, deprecation, sunsetting)? See also "Endpoints and Operations" for versioning strategy details + +## Output Format + +```markdown +# API Requirements Summary + +## Overview + +- API Name: [name] +- API Type: [REST API / HTTP API / WebSocket API] +- Endpoint Type: [Edge-optimized / Regional / Private] +- Purpose: [description] +- Target Consumers: [who] +- Expected Volume: [requests/day, peak rps] +- Multi-tenant: [yes/no, isolation level] +- Account Topology: [single account / per-domain accounts / central API account] +- Cost Sensitivity: [budget constraints, API type preference] + +## Endpoints + +| Resource | Method | Path | Description | Auth | +| -------- | ------ | ---- | ----------- | ---- | +| ... | ... | ... | ... | ... | + +## API Versioning + +- Strategy: [URL path / header-based / query parameter] +- Concurrent Versions: [number] +- Deprecation Policy: [timeline, communication plan] + +## Authentication and Authorization + +- Method: [auth method] +- Authorization Model: [model] +- Roles/Permissions: [details] + +## Data Models + +[Entity definitions with attributes and types] + +## Binary Data and Large Payloads + +- Binary Media Types: [list or none] +- File Upload/Download: [direct API / presigned S3 URLs] +- Max Expected Payload Size: [size] + +## WebSocket Requirements (if applicable) + +- Route Selection Expression: [field] +- Custom Routes: [list] +- Session Management: [connection tracking strategy] +- Message Patterns: [request-response / broadcast / targeted push] +- Expected Concurrent Connections: [number] +- Client Resilience: [reconnect, heartbeat strategy] + +## Performance Requirements + +- Target Latency: [ms] +- Rate Limits: [requests/second] +- Burst Limit: [requests] +- Caching: [yes/no, TTL] +- Per-Tenant Throttling: [yes/no, tier structure] + +## Security Requirements + +- WAF: [yes/no] +- CORS: [origins, methods, headers] +- TLS: [minimum version] +- mTLS: [yes/no] +- Compliance: [standards] + +## Observability + +- Execution Logging: [level] +- Access Logging: [format] +- X-Ray Tracing: [yes/no] +- Key Metrics: [list] +- Alarms: [list] + +## Deployment and Testing + +- Environments: [list] +- Strategy: [canary/blue-green/direct] +- IaC Tool: [SAM/CDK/CloudFormation/Terraform] +- CI/CD: [pipeline tool] +- Local Testing: [sam local / deployed dev stage] +- Integration Testing: [approach] + +## Custom Domain and Routing + +- Domain: [domain name] +- Endpoint Type: [edge-optimized/regional/private] +- Routing: [routing rules (recommended) / base path mappings] +- Header-Based Routing: [yes/no, use case] +- Multi-region: [yes/no, strategy] +- Data Consistency: [conflict resolution strategy] + +## Governance + +- Tag Requirements: [required tags] +- Audit: [CloudTrail/Config requirements] +- Standards Enforcement: [SCPs/Hooks/Config rules] +``` diff --git a/plugins/aws-serverless/skills/api-gateway/references/sam-cloudformation.md b/plugins/aws-serverless/skills/api-gateway/references/sam-cloudformation.md new file mode 100644 index 0000000..459203b --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/sam-cloudformation.md @@ -0,0 +1,229 @@ +# SAM and CloudFormation Patterns + +## Table of Contents + +- [OpenAPI Extensions](#openapi-extensions) +- [Common SAM Patterns](#common-sam-patterns) + - [Custom Domain with Base Path Mapping](#custom-domain-with-base-path-mapping) +- [Infrastructure Patterns](#infrastructure-patterns) + - [Private API with VPC Endpoint](#private-api-with-vpc-endpoint) + - [Gateway Responses with CORS Headers](#gateway-responses-with-cors-headers) + - [Response Streaming](#response-streaming) + - [Routing Rules](#routing-rules) +- [Key Pitfalls](#key-pitfalls) +- [VTL Mapping Templates (REST API)](#vtl-mapping-templates-rest-api) +- [HTTP API Parameter Mapping](#http-api-parameter-mapping) +- [Binary Data Handling](#binary-data-handling) + +For direct AWS service integration templates (EventBridge, SQS, DynamoDB, Kinesis, Step Functions), see `references/sam-service-integrations.md`. + +--- + +## OpenAPI Extensions + +Define API Gateway configuration inline in OpenAPI specs: + +| Extension | Purpose | +| ---------------------------------------------- | ------------------------------------------------------------ | +| `x-amazon-apigateway-integration` | Integration configuration (Lambda, HTTP, AWS service, mock) | +| `x-amazon-apigateway-request-validators` | Request validation rules | +| `x-amazon-apigateway-binary-media-types` | Binary content type registration | +| `x-amazon-apigateway-gateway-responses` | Custom error responses | +| `x-amazon-apigateway-cors` | HTTP API CORS configuration | +| `x-amazon-apigateway-endpoint-configuration` | Endpoint type, VPC endpoint IDs, `disableExecuteApiEndpoint` | +| `x-amazon-apigateway-authorizer` | Authorizer definitions | +| `x-amazon-apigateway-policy` | Embedded resource policy | +| `x-amazon-apigateway-minimum-compression-size` | Payload compression threshold | +| `x-amazon-apigateway-integrations` | Reusable integration components (HTTP API only) | +| `x-amazon-apigateway-importexport-version` | OpenAPI 3.0 export format version | + +## Common SAM Patterns + +For basic Lambda proxy and auth SAM templates (JWT authorizer, Cognito authorizer, Lambda authorizer, API keys), see the [aws-lambda web-app-deployment reference](../../aws-lambda/references/web-app-deployment.md). + +### Custom Domain with Base Path Mapping + +```yaml +MyDomain: + Type: AWS::ApiGatewayV2::DomainName + Properties: + DomainName: api.example.com + DomainNameConfigurations: + - EndpointType: REGIONAL + CertificateArn: !Ref MyCertificate + +MyMapping: + Type: AWS::ApiGatewayV2::ApiMapping + Properties: + DomainName: !Ref MyDomain + ApiId: !Ref MyApi + Stage: !Ref MyStage + ApiMappingKey: v1/orders +``` + +## Infrastructure Patterns + +### Private API with VPC Endpoint + +```yaml +MyApi: + Type: AWS::Serverless::Api + Properties: + StageName: prod + EndpointConfiguration: + Type: PRIVATE + VPCEndpointIds: + - !Ref VpcEndpoint + Auth: + ResourcePolicy: + CustomStatements: + - Effect: Allow + Principal: "*" + Action: execute-api:Invoke + Resource: execute-api:/* + Condition: + StringEquals: + aws:SourceVpce: !Ref VpcEndpoint +``` + +### Gateway Responses with CORS Headers + +```yaml +MyApi: + Type: AWS::Serverless::Api + Properties: + StageName: Prod + Cors: + AllowMethods: "'GET,POST,OPTIONS'" + AllowHeaders: "'Content-Type,Authorization'" + AllowOrigin: "'https://example.com'" + GatewayResponses: + DEFAULT_4XX: + ResponseParameters: + Headers: + Access-Control-Allow-Origin: "'https://example.com'" + Access-Control-Allow-Headers: "'Content-Type,Authorization'" + DEFAULT_5XX: + ResponseParameters: + Headers: + Access-Control-Allow-Origin: "'https://example.com'" + Access-Control-Allow-Headers: "'Content-Type,Authorization'" +``` + +### Response Streaming + +```yaml +MyFunction: + Type: AWS::Serverless::Function + Properties: + Handler: app.handler + Runtime: nodejs20.x + Events: + Stream: + Type: Api + Properties: + Path: /stream + Method: get + RestApiId: !Ref MyApi + +MyApi: + Type: AWS::Serverless::Api + Properties: + StageName: Prod + DefinitionBody: + openapi: "3.0" + paths: + /stream: + get: + x-amazon-apigateway-integration: + type: aws_proxy + httpMethod: POST + uri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${MyFunction.Arn}/invocations" + responseTransferMode: STREAM +``` + +### Routing Rules + +```yaml +# Based on https://github.com/aws-samples/serverless-samples/blob/main/apigw-header-routing/template-header-based-routing.yaml +MyRoutingRule: + Type: AWS::ApiGatewayV2::RoutingRule + Properties: + DomainNameArn: !Sub "arn:aws:apigateway:${AWS::Region}::/domainnames/${MyDomain}" + Priority: 100 + Conditions: + - MatchHeaders: + AnyOf: + - Header: X-API-Version + ValueGlob: "v2*" + Actions: + - InvokeApi: + ApiId: !Ref MyV2Api + Stage: prod + StripBasePath: false +``` + +## Key Pitfalls + +For general SAM/CloudFormation pitfalls (circular dependencies, `!Sub` defaults, YAML duplicate keys, build issues, layer nesting, `confirm_changeset`), see the [aws-serverless-deployment skill](../../aws-serverless-deployment/). + +API-Gateway-specific pitfalls: + +1. **Root-level `security` in OpenAPI is ignored**. Must set per-operation +2. **`$ref` cannot reference external files** in OpenAPI; only internal references (`#/definitions/` for Swagger 2.0, `#/components/schemas/` for OpenAPI 3.0) +3. **JSON Schema draft 4 only**: no `discriminator`, `nullable`, `exclusiveMinimum`, no `oneOf`/`anyOf`/`allOf` with `$ref` in same schema + +## VTL Mapping Templates (REST API) + +### Key Variables + +- `$input.body`: Raw request body +- `$input.json('$.jsonpath')`: Extract JSON +- `$input.path('$.jsonpath')`: Extract as object +- `$input.params('name')`: Get parameter by name +- `$input.params().header`, `.querystring`, `.path`: Parameter maps +- `$context.*`: Request context +- `$stageVariables.*`: Stage variables +- `$util.escapeJavaScript()`, `$util.parseJson()`, `$util.urlEncode()`, `$util.base64Encode()`, `$util.base64Decode()` + +### Limits + +- Template size: 300 KB +- `#foreach` iterations: 1,000 + +### Passthrough Behavior + +- `WHEN_NO_MATCH`: Pass through when no template matches Content-Type +- `WHEN_NO_TEMPLATES`: Pass through when no templates defined +- `NEVER`: Reject with 415 Unsupported Media Type + +### Response Override + +```velocity +#set($context.responseOverride.status = 400) +#set($context.responseOverride.header.X-Custom = "value") +``` + +**Gotcha**: Applying override to same parameter twice causes 5XX + +## HTTP API Parameter Mapping + +No VTL. Simple expressions: + +- `$request.header.name`, `$request.querystring.name`, `$request.body.jsonpath`, `$request.path.name` +- `$context.*`, `$stageVariables.*` +- Actions: `overwrite`, `append`, `remove` + +## Binary Data Handling + +### REST API + +- Register binary media types (e.g., `image/png`, `*/*`) +- `contentHandling` on Integration/IntegrationResponse: `CONVERT_TO_BINARY` or `CONVERT_TO_TEXT` +- Lambda proxy: `isBase64Encoded: true` in response; request body arrives as base64 when binary +- Only the first `Accept` media type is honored + +### HTTP API + +- Payload format 2.0: `isBase64Encoded` in request event automatically. Lambda returns `isBase64Encoded: true` +- No need to register binary media types explicitly diff --git a/plugins/aws-serverless/skills/api-gateway/references/sam-service-integrations.md b/plugins/aws-serverless/skills/api-gateway/references/sam-service-integrations.md new file mode 100644 index 0000000..3aede58 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/sam-service-integrations.md @@ -0,0 +1,730 @@ +# SAM Service Integration Templates + +Direct AWS service integrations without Lambda. These templates connect API Gateway directly to AWS services using VTL mapping templates (REST API) or WebSocket request templates. + +## Table of Contents + +- [EventBridge](#direct-aws-service-integration-eventbridge) +- [SQS](#direct-aws-service-integration-sqs) +- [DynamoDB Full CRUD](#direct-aws-service-integration-dynamodb-full-crud) + - [Option A: OpenAPI-based definition](#option-a-openapi-based-definition-recommended-for-complex-apis) + - [Option B: Inline CloudFormation methods](#option-b-inline-cloudformation-methods-simpler-for-small-apis) +- [Kinesis Data Streams](#direct-aws-service-integration-kinesis-data-streams) +- [Step Functions (REST API)](#direct-aws-service-integration-step-functions) +- [Step Functions (WebSocket API)](#websocket-api-express-workflow-sync-and-standard-workflow-async-with-callback) + +--- + +## Direct AWS Service Integration (EventBridge) + +Based on [aws-samples/serverless-patterns/apigw-rest-api-eventbridge-sam](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-rest-api-eventbridge-sam). Supports batching multiple events via `#foreach`. + +```yaml +MyCustomEventBus: + Type: AWS::Events::EventBus + Properties: + Name: !Sub "${AWS::StackName}-EventBus" + +ApiGatewayEventBridgeRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: apigateway.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: EBPutEvents + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: events:PutEvents + Resource: !GetAtt MyCustomEventBus.Arn + +EventBridgeIntegration: + Type: AWS::ApiGateway::Method + Properties: + HttpMethod: POST + ResourceId: !GetAtt Api.RootResourceId + RestApiId: !Ref Api + AuthorizationType: NONE + Integration: + Type: AWS + IntegrationHttpMethod: POST + Credentials: !GetAtt ApiGatewayEventBridgeRole.Arn + Uri: !Sub "arn:aws:apigateway:${AWS::Region}:events:action/PutEvents" + PassthroughBehavior: WHEN_NO_TEMPLATES + RequestTemplates: + application/json: !Sub + - |- + #set($context.requestOverride.header.X-Amz-Target = "AWSEvents.PutEvents") + #set($context.requestOverride.header.Content-Type = "application/x-amz-json-1.1") + #set($inputRoot = $input.path('$')) + { + "Entries": [ + #foreach($elem in $inputRoot.items) + { + "Detail": "$util.escapeJavaScript($elem.Detail).replaceAll("\\'","'")", + "DetailType": "$elem.DetailType", + "EventBusName": "${EventBusName}", + "Source": "$elem.Source" + }#if($foreach.hasNext),#end + #end + ] + } + - EventBusName: !Ref MyCustomEventBus + IntegrationResponses: + - StatusCode: "200" + ResponseTemplates: + application/json: '{}' + MethodResponses: + - StatusCode: "200" + ResponseModels: + application/json: Empty +``` + +## Direct AWS Service Integration (SQS) + +Based on [aws-samples/serverless-patterns/apigw-sqs-lambda-iot](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-sqs-lambda-iot). Uses query protocol with `PassthroughBehavior: NEVER` to reject unmatched content types. + +```yaml +SqsIntegration: + Type: AWS::ApiGateway::Method + Properties: + HttpMethod: POST + ResourceId: !Ref MessagesResource + RestApiId: !Ref MyRestApi + AuthorizationType: CUSTOM + AuthorizerId: !Ref MyAuthorizer + RequestValidatorId: !Ref BodyValidator + RequestModels: + application/json: !Ref MessageModel + Integration: + Type: AWS + IntegrationHttpMethod: POST + Uri: !Sub "arn:aws:apigateway:${AWS::Region}:sqs:path/${AWS::AccountId}/${MyQueue.QueueName}" + Credentials: !GetAtt SqsRole.Arn + PassthroughBehavior: NEVER + RequestParameters: + integration.request.header.Content-Type: "'application/x-www-form-urlencoded'" + RequestTemplates: + application/json: "Action=SendMessage&MessageBody=$util.urlEncode($input.body)" + IntegrationResponses: + - StatusCode: "200" + ResponseTemplates: + application/json: '{"messageId": "$input.path(''$.SendMessageResponse.SendMessageResult.MessageId'')"}' + MethodResponses: + - StatusCode: "200" + +SqsRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: apigateway.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: SqsSendMessage + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: sqs:SendMessage + Resource: !GetAtt MyQueue.Arn +``` + +## Direct AWS Service Integration (DynamoDB Full CRUD) + +Full CRUD pattern based on [aws-samples/serverless-patterns](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-dynamodb-lambda-scheduler-ses-auto-deletion-sam). Uses OpenAPI definition with `AWS::Include` for clean separation of API spec and infrastructure. + +### Option A: OpenAPI-based definition (recommended for complex APIs) + +SAM template references an external OpenAPI file: + +```yaml +MyApi: + Type: AWS::Serverless::Api + Properties: + StageName: v1 + EndpointConfiguration: + Type: REGIONAL + DefinitionBody: + 'Fn::Transform': + Name: 'AWS::Include' + Parameters: + Location: './restapi/api.yaml' + MethodSettings: + - ResourcePath: "/*" + HttpMethod: "*" + LoggingLevel: ERROR + +MyTable: + Type: AWS::DynamoDB::Table + Properties: + AttributeDefinitions: + - AttributeName: id + AttributeType: S + KeySchema: + - AttributeName: id + KeyType: HASH + BillingMode: PAY_PER_REQUEST + StreamSpecification: + StreamViewType: NEW_IMAGE + +APIGatewayDynamoDBRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: apigateway.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: DynamoDbCrud + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - dynamodb:GetItem + - dynamodb:UpdateItem + - dynamodb:DeleteItem + - dynamodb:Scan + - dynamodb:Query + Resource: !GetAtt MyTable.Arn +``` + +OpenAPI file (`restapi/api.yaml`), key methods shown: + +```yaml +openapi: "3.0.1" +info: + title: "my-api" +paths: + /items: + ## Create: uses UpdateItem with $context.requestId as auto-generated ID + post: + requestBody: + content: + application/json: + schema: + $ref: "#/components/schemas/item" + required: true + x-amazon-apigateway-request-validator: "Validate body" + x-amazon-apigateway-integration: + credentials: + Fn::GetAtt: [APIGatewayDynamoDBRole, Arn] + httpMethod: "POST" + uri: + Fn::Sub: "arn:aws:apigateway:${AWS::Region}:dynamodb:action/UpdateItem" + requestTemplates: + application/json: + Fn::Sub: | + { + "TableName": "${MyTable}", + "Key": { + "id": {"S": "$context.requestId"} + }, + "UpdateExpression": "set description = :description, #dt = :dt, #email = :email", + "ExpressionAttributeValues": { + ":description": {"S": "$input.path('$.description')"}, + ":dt": {"S": "$input.path('$.datetime')"}, + ":email": {"S": "$input.path('$.email')"} + }, + "ExpressionAttributeNames": { + "#dt": "datetime", + "#email": "email" + }, + "ReturnValues": "ALL_NEW" + } + responses: + default: + statusCode: "200" + responseTemplates: + application/json: | + #set($inputRoot = $input.path('$')) + { + "message": "Item created successfully", + "data": { + "id": "$inputRoot.Attributes.id.S", + "description": "$inputRoot.Attributes.description.S", + "datetime": "$inputRoot.Attributes.datetime.S" + } + } + type: "aws" + ## List: Scan with #foreach to build JSON array + get: + x-amazon-apigateway-integration: + credentials: + Fn::GetAtt: [APIGatewayDynamoDBRole, Arn] + httpMethod: "POST" + uri: + Fn::Sub: "arn:aws:apigateway:${AWS::Region}:dynamodb:action/Scan" + requestTemplates: + application/json: + Fn::Sub: | + #set($startKey = $input.params('startKey')) + { + "TableName": "${MyTable}", + "Limit": 25 + #if($startKey != "") + ,"ExclusiveStartKey": {"id": {"S": "$startKey"}} + #end + } + responses: + default: + statusCode: "200" + responseTemplates: + application/json: | + #set($inputRoot = $input.path('$')) + { + "items": [ + #foreach($elem in $inputRoot.Items) + { + "id": "$elem.id.S", + "description": "$elem.description.S", + "datetime": "$elem.datetime.S" + }#if($foreach.hasNext),#end + #end + ] + #if($inputRoot.LastEvaluatedKey.id.S != "") + ,"nextKey": "$inputRoot.LastEvaluatedKey.id.S" + #end + } + type: "aws" + /items/{id}: + ## Read + get: + parameters: + - name: "id" + in: "path" + required: true + schema: + type: "string" + x-amazon-apigateway-integration: + credentials: + Fn::GetAtt: [APIGatewayDynamoDBRole, Arn] + httpMethod: "POST" + uri: + Fn::Sub: "arn:aws:apigateway:${AWS::Region}:dynamodb:action/GetItem" + requestTemplates: + application/json: + Fn::Sub: | + { + "TableName": "${MyTable}", + "Key": { + "id": {"S": "$input.params().path.id"} + } + } + responses: + default: + statusCode: "200" + responseTemplates: + application/json: | + #set($inputRoot = $input.path('$')) + { + "id": "$inputRoot.Item.id.S", + "description": "$inputRoot.Item.description.S", + "datetime": "$inputRoot.Item.datetime.S" + } + type: "aws" + ## Delete: returns deleted item via ReturnValues: ALL_OLD + delete: + parameters: + - name: "id" + in: "path" + required: true + schema: + type: "string" + x-amazon-apigateway-integration: + credentials: + Fn::GetAtt: [APIGatewayDynamoDBRole, Arn] + httpMethod: "POST" + uri: + Fn::Sub: "arn:aws:apigateway:${AWS::Region}:dynamodb:action/DeleteItem" + requestTemplates: + application/json: + Fn::Sub: | + { + "TableName": "${MyTable}", + "Key": { + "id": {"S": "$input.params().path.id"} + }, + "ReturnValues": "ALL_OLD" + } + responses: + default: + statusCode: "200" + responseTemplates: + application/json: | + #set($inputRoot = $input.path('$')) + { + "message": "Item deleted successfully", + "data": { + "id": "$inputRoot.Attributes.id.S", + "description": "$inputRoot.Attributes.description.S" + } + } + type: "aws" +components: + schemas: + item: + required: [description, datetime, email] + type: object + properties: + description: + type: string + datetime: + type: string + format: date-time + email: + type: string + format: email +x-amazon-apigateway-gateway-responses: + BAD_REQUEST_BODY: + statusCode: 400 + responseTemplates: + application/json: '{"error": "$context.error.validationErrorString"}' +x-amazon-apigateway-request-validators: + Validate body: + validateRequestParameters: false + validateRequestBody: true +``` + +### Option B: Inline CloudFormation methods (simpler for small APIs) + +```yaml +## Single-method example: use Option A for full CRUD APIs +DynamoDbGetIntegration: + Type: AWS::ApiGateway::Method + Properties: + HttpMethod: GET + ResourceId: !Ref ItemResource + RestApiId: !Ref MyRestApi + AuthorizationType: CUSTOM + AuthorizerId: !Ref MyAuthorizer + RequestParameters: + method.request.path.id: true + Integration: + Type: AWS + IntegrationHttpMethod: POST + Uri: !Sub "arn:aws:apigateway:${AWS::Region}:dynamodb:action/GetItem" + Credentials: !GetAtt DynamoDbRole.Arn + RequestTemplates: + application/json: !Sub | + { + "TableName": "${MyTable}", + "Key": { + "id": {"S": "$input.params('id')"} + } + } + IntegrationResponses: + - StatusCode: "200" + ResponseTemplates: + application/json: | + #set($item = $input.path('$.Item')) + { + "id": "$item.id.S", + "data": "$item.data.S", + "createdAt": "$item.createdAt.S" + } + MethodResponses: + - StatusCode: "200" +``` + +## Direct AWS Service Integration (Kinesis Data Streams) + +Based on [aws-samples/serverless-patterns/apigw-kinesis-lambda](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-kinesis-lambda). Uses REST API resources with path parameter for stream name and Lambda as stream consumer. + +```yaml +KinesisStream: + Type: AWS::Kinesis::Stream + Properties: + ShardCount: 1 + +APIGatewayRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: apigateway.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: APIGWKinesisPolicy + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - kinesis:PutRecord + - kinesis:PutRecords + Resource: !Sub "${KinesisStream.Arn}*" + +Api: + Type: AWS::ApiGateway::RestApi + Properties: + Name: apigw-kinesis-api + +## Resources: /streams/{stream-name}/record and /streams/{stream-name}/records +## (resource definitions omitted for brevity) + +## PutRecord: single record ingestion +recordMethodPut: + Type: AWS::ApiGateway::Method + Properties: + RestApiId: !Ref Api + ResourceId: !Ref record + HttpMethod: PUT + AuthorizationType: NONE + Integration: + Type: AWS + Credentials: !GetAtt APIGatewayRole.Arn + IntegrationHttpMethod: POST + Uri: !Sub "arn:aws:apigateway:${AWS::Region}:kinesis:action/PutRecord" + PassthroughBehavior: WHEN_NO_TEMPLATES + RequestTemplates: + application/json: | + { + "StreamName": "$input.params('stream-name')", + "Data": "$util.base64Encode($input.json('$.Data'))", + "PartitionKey": "$input.path('$.PartitionKey')" + } + IntegrationResponses: + - StatusCode: "200" + MethodResponses: + - StatusCode: "200" + +## PutRecords: batch ingestion with #foreach +recordsMethodPut: + Type: AWS::ApiGateway::Method + Properties: + RestApiId: !Ref Api + ResourceId: !Ref records + HttpMethod: PUT + AuthorizationType: NONE + Integration: + Type: AWS + Credentials: !GetAtt APIGatewayRole.Arn + IntegrationHttpMethod: POST + Uri: !Sub "arn:aws:apigateway:${AWS::Region}:kinesis:action/PutRecords" + PassthroughBehavior: WHEN_NO_TEMPLATES + RequestTemplates: + application/json: | + { + "StreamName": "$input.params('stream-name')", + "Records": [ + #foreach($elem in $input.path('$.records')) + { + "Data": "$util.base64Encode($elem.data)", + "PartitionKey": "$elem.partition-key" + }#if($foreach.hasNext),#end + #end + ] + } + IntegrationResponses: + - StatusCode: "200" + MethodResponses: + - StatusCode: "200" + +## Lambda consumer triggered by Kinesis stream +LambdaConsumer: + Type: AWS::Serverless::Function + Properties: + Handler: lambda_function.lambda_handler + Runtime: python3.13 + CodeUri: src/ + Policies: + - KinesisStreamReadPolicy: + StreamName: !Ref KinesisStream + Events: + Stream: + Type: Kinesis + Properties: + Stream: !GetAtt KinesisStream.Arn + StartingPosition: LATEST + BatchSize: 100 +``` + +## Direct AWS Service Integration (Step Functions) + +REST API pattern based on [aws-samples/serverless-patterns/apigw-rest-stepfunction](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-rest-stepfunction). WebSocket pattern based on [aws-samples/serverless-samples/apigw-ws-integrations](https://github.com/aws-samples/serverless-samples/tree/main/apigw-ws-integrations). + +### REST API: Standard workflow (async, fire-and-forget with polling) + +```yaml +WaitableStateMachine: + Type: AWS::Serverless::StateMachine + Properties: + DefinitionUri: statemachine/workflow.asl.json + DefinitionSubstitutions: + DDBTable: !Ref StatusTable + Policies: + - DynamoDBWritePolicy: + TableName: !Ref StatusTable + +StatusTable: + Type: AWS::Serverless::SimpleTable + Properties: + PrimaryKey: + Name: Id + Type: String + +ApiGatewayStepFunctionsRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: apigateway.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: CallStepFunctions + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: states:StartExecution + Resource: !Ref WaitableStateMachine + +StartExecutionMethod: + Type: AWS::ApiGateway::Method + Properties: + RestApiId: !Ref Api + ResourceId: !GetAtt Api.RootResourceId + HttpMethod: POST + AuthorizationType: NONE + Integration: + Type: AWS + IntegrationHttpMethod: POST + Credentials: !GetAtt ApiGatewayStepFunctionsRole.Arn + Uri: !Sub "arn:aws:apigateway:${AWS::Region}:states:action/StartExecution" + PassthroughBehavior: WHEN_NO_TEMPLATES + RequestTemplates: + application/json: !Sub + - |- + #set($data = $util.escapeJavaScript($input.json('$'))) + { + "input": "$data", + "stateMachineArn": "${StateMachineArn}" + } + - StateMachineArn: !Ref WaitableStateMachine + IntegrationResponses: + - StatusCode: "200" + ResponseTemplates: + application/json: '' + MethodResponses: + - StatusCode: "200" + ResponseModels: + application/json: Empty +``` + +### WebSocket API: Express workflow (sync) and Standard workflow (async with callback) + +```yaml +## Express workflow: synchronous, returns result to WebSocket client +SyncSFn: + Type: AWS::Serverless::StateMachine + Properties: + Type: EXPRESS + Definition: + StartAt: ProcessRequest + States: + ProcessRequest: + Type: Wait + Seconds: 5 + End: true + +SFnSyncRouteIntegration: + Type: AWS::ApiGatewayV2::Integration + Properties: + ApiId: !Ref WebSocketApi + IntegrationType: AWS + IntegrationMethod: POST + IntegrationUri: !Sub "arn:aws:apigateway:${AWS::Region}:states:action/StartSyncExecution" + CredentialsArn: !GetAtt StepFunctionsSyncExecutionRole.Arn + TemplateSelectionExpression: \$default + RequestTemplates: + "$default": + Fn::Sub: > + #set($sfn_input=$util.escapeJavaScript($input.json("$.data")).replaceAll("\\'","'")) + { + "input": "$sfn_input", + "stateMachineArn": "${SyncSFn}" + } + +## Standard workflow: async, pushes result back via @connections API +AsyncSFn: + Type: AWS::Serverless::StateMachine + Properties: + Type: STANDARD + Definition: + StartAt: ProcessRequest + States: + ProcessRequest: + Type: Wait + Seconds: 5 + Next: NotifyClient + NotifyClient: + Type: Task + Resource: arn:aws:states:::apigateway:invoke + Parameters: + ApiEndpoint: !Sub "${WebSocketApi}.execute-api.${AWS::Region}.amazonaws.com" + Method: POST + Stage: !Ref ApiStageName + Path.$: "States.Format('/@connections/{}', $.ConnectionID)" + RequestBody: + Message: Processing complete! + AuthType: IAM_ROLE + End: true + +SFnAsyncRouteIntegration: + Type: AWS::ApiGatewayV2::Integration + Properties: + ApiId: !Ref WebSocketApi + IntegrationType: AWS + IntegrationMethod: POST + IntegrationUri: !Sub "arn:aws:apigateway:${AWS::Region}:states:action/StartExecution" + CredentialsArn: !GetAtt StepFunctionsAsyncExecutionRole.Arn + TemplateSelectionExpression: \$default + RequestTemplates: + "$default": + Fn::Sub: > + #set($sfn_input=$util.escapeJavaScript($input.json("$.data")).replaceAll("\\'","'")) + { + "input": "{\"data\":$sfn_input, \"ConnectionID\":\"$context.connectionId\"}", + "stateMachineArn": "${AsyncSFn}" + } + +## Async workflow role needs @connections permission to push results back +AsyncSFnRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: states.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: APIGWConnectionsAccess + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: execute-api:ManageConnections + Resource: !Sub "arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${WebSocketApi}/${ApiStageName}/POST/@connections/*" +``` diff --git a/plugins/aws-serverless/skills/api-gateway/references/security.md b/plugins/aws-serverless/skills/api-gateway/references/security.md new file mode 100644 index 0000000..7a84ef7 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/security.md @@ -0,0 +1,260 @@ +# Security Best Practices + +## TLS Configuration + +### TLS Security Policies + +- TLS 1.2 is the recommended minimum +- Two naming conventions: legacy policies use `TLS_` prefix (e.g., `TLS_1_0`, `TLS_1_2`), enhanced policies use `SecurityPolicy_` prefix (e.g., `SecurityPolicy_TLS13_1_3_2025_09`) and support TLS 1.3 and post-quantum cryptography +- Edge-optimized endpoints: use CloudFront TLS stack with `_EDGE` suffix policies (e.g., `SecurityPolicy_TLS13_2025_EDGE`). Supports TLS 1.3 +- Regional/Private endpoints: use API Gateway TLS stack with date-based suffix policies (e.g., `SecurityPolicy_TLS13_1_2_PQ_2025_09`). Supports TLS 1.3 and post-quantum +- **Endpoint access mode**: `BASIC` (standard) vs `STRICT` (validates SNI matches Host header) +- Migration: Apply enhanced policy with BASIC first, validate with access logs (`$context.tlsVersion`, `$context.cipherSuite`), then switch to STRICT +- STRICT mode takes up to 15 minutes to propagate + +### Disable Default Endpoint + +- Set `disableExecuteApiEndpoint: true` to force all traffic through custom domain +- REST APIs return 403 when disabled; HTTP APIs return 404 (observed behavior; not explicitly documented in the developer guide) +- **Must redeploy after changing this setting** + +## Mutual TLS (mTLS) + +### Setup + +1. Create CA hierarchy (ACM Private CA or self-managed) +2. Export root CA public key to PEM truststore file +3. Upload to versioned S3 bucket (same region as API Gateway) +4. Configure custom domain with truststore URI (`s3://bucket/truststore.pem`) +5. Use public ACM certificate for the API Gateway domain itself +6. Disable default endpoint + +### Automation with ACM Private CA + +- Lambda-backed CloudFormation custom resource concatenates certificate chain and uploads to S3 +- Certificates issued by root CA or any subordinate CA are automatically trusted +- SAM resources: `AWS::ACMPCA::CertificateAuthority`, `AWS::ACMPCA::Certificate`, `AWS::CertificateManager::Certificate` + +### Certificate Revocation Lists (CRL) + +- API Gateway does NOT check CRLs natively +- Implement via Lambda authorizer: + 1. Extract client cert serial number + 2. Check against CRL stored in DynamoDB (fast lookups at scale) + 3. Deny access if revoked +- S3 PutObject event triggers preprocessing Lambda: validates CRL signature, decodes ASN.1, stores simplified JSON +- Use function-level caching with S3 ETag for cache invalidation +- **Lambda authorizer caching and CRL checks**: Disable authorizer caching if near-real-time revocation is required. If some latency is acceptable, use a short TTL (e.g., 60s) matching your revocation freshness requirements + +### Propagating Client Identity + +- Lambda authorizer extracts client cert subject: `from cryptography import x509; cert = x509.load_pem_x509_certificate(event['requestContext']['identity']['clientCert']['clientCertPem'].encode())` +- Returns `clientCertSub` (from `cert.subject.rfc4514_string()`) in authorizer `context` +- API Gateway injects via `RequestParameters: 'integration.request.header.X-Client-Cert-Sub': 'context.authorizer.clientCertSub'` +- Backend receives client identity without performing mTLS validation itself + +### CloudFront Viewer mTLS + +CloudFront now supports mTLS authentication between viewers (clients) and CloudFront edge locations. This enables mTLS for any origin, including API Gateway HTTP APIs and WebSocket APIs that don't natively support mTLS. + +**Architecture**: Client <-> mTLS <-> CloudFront <-> API Gateway (any type) + +**Setup**: + +1. Upload root CA and intermediate CA certificates (PEM bundle) to S3 +2. Create CloudFront Trust Store referencing the S3 path +3. Enable "Viewer mutual authentication (mTLS)" on distribution settings +4. Associate the trust store with the distribution + +**Modes**: + +- **Required**: All clients must present valid certificates +- **Optional**: Accepts both mTLS and non-mTLS clients on the same distribution; still rejects invalid certificates + +**Certificate headers forwarded to origin** (enable in origin request policy): + +- `CloudFront-Viewer-Cert-Serial-Number`, `CloudFront-Viewer-Cert-Issuer`, `CloudFront-Viewer-Cert-Subject` +- `CloudFront-Viewer-Cert-Validity`, `CloudFront-Viewer-Cert-PEM`, `CloudFront-Viewer-Cert-Present`, `CloudFront-Viewer-Cert-SHA256` +- Use these headers in Lambda authorizers or backend logic for identity-based access control + +**Certificate revocation**: Use CloudFront Connection Functions + KeyValueStore for real-time CRL checks during TLS handshake: + +- Store revoked serial numbers in KeyValueStore +- Connection function checks `connection.clientCertificate.certificates.leaf.serialNumber` against KVS +- Call `connection.deny()` for revoked certificates; rejection happens at the edge before any request reaches the origin + +**Connection functions**: Execute custom logic during TLS handshake (before viewer request). Can allow, deny, or log connection details. Use `connection.logCustomData()` for custom connection log entries. + +**Monitoring**: CloudFront connection logs capture TLS handshake details. Each connection gets a unique `connectionId` that correlates across connection logs, standard logs, and real-time logs. + +**Options**: + +- `Ignore certificate expiration date`: Accept expired certificates (useful for gradual migration) +- `Advertise trust store CA names`: Send list of accepted CA distinguished names to clients +- Certificate chain depth supported: up to 4 (root + intermediates) + +**When to use CloudFront viewer mTLS instead of API Gateway mTLS**: + +- HTTP APIs or WebSocket APIs that don't support native mTLS +- Need edge-based certificate validation (lower latency for global clients) +- Need optional mode (mixed mTLS and non-mTLS on same endpoint) +- Need Connection Functions for custom TLS handshake logic +- Need real-time CRL checks via KeyValueStore (faster than Lambda authorizer approach) + +### Private APIs + mTLS + +- Private APIs do not natively support mTLS +- Pattern: ALB with mTLS "Verify with trust store" mode -> VPC endpoint -> private API +- ALB supports CRL checks on trust store +- Cross-account: client -> PrivateLink -> NLB -> ALB (mTLS) -> VPC endpoint -> private API + +## Resource Policies (REST API) + +### IP-Based Access Control + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": "*", + "Action": "execute-api:Invoke", + "Resource": "execute-api:/*" + }, + { + "Effect": "Deny", + "Principal": "*", + "Action": "execute-api:Invoke", + "Resource": "execute-api:/*", + "Condition": { "NotIpAddress": { "aws:SourceIp": ["203.0.113.0/24", "198.51.100.0/24"] } } + } + ] +} +``` + +- `execute-api:/*` is a shorthand accepted only in API Gateway resource policies (API Gateway auto-expands it to the full ARN (`arn:aws:execute-api:region:account-id:api-id/*`)). Do not use this shorthand in IAM identity-based policies +- For traffic through VPC endpoint: use `aws:VpcSourceIp` instead of `aws:SourceIp` + +### HTTP API IP Control + +- No resource policies for HTTP APIs +- Use Lambda authorizer checking `event.requestContext.http.sourceIp` against allowlist +- Support both exact IPs and CIDR ranges + +### VPC Endpoint Restriction + +```json +{ "Condition": { "StringEquals": { "aws:SourceVpce": "vpce-0123456789abcdef0" } } } +``` + +## AWS WAF (REST API Direct; HTTP API via CloudFront) + +- **REST API**: Associate Web ACL directly with API stage (no CloudFront required) +- **HTTP API**: No direct WAF association. Workaround: Place CloudFront distribution in front of HTTP API and attach WAF to CloudFront. This is a common production pattern +- Gateway response type `WAF_FILTERED` (403) when WAF blocks request +- Access log variables: `$context.waf.error`, `$context.waf.latency`, `$context.waf.status` +- **Body inspection limit**: Default 16 KB for regional API Gateway (configurable up to 64 KB in web ACL for additional cost) +- **Header inspection limit**: First 8 KB or 200 headers (whichever comes first) +- **"Distributed Denial of Wallet" protection**: Without WAF, DDoS traffic invokes Lambda authorizers for every request. WAF blocks malicious traffic before it reaches the authorizer + +### Recommended Managed Rule Groups for APIs + +**Tier 1: Always enable for API protection.** + +| Rule Group | Name | WCU | Purpose | +| ---------------- | --------------------------------------- | --- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Core Rule Set | `AWSManagedRulesCommonRuleSet` | 700 | XSS, LFI, RFI, SSRF, path traversal, size restrictions. **Note**: `SizeRestrictions_BODY` rule blocks bodies >8 KB; override to Count if your API accepts larger payloads | +| Known Bad Inputs | `AWSManagedRulesKnownBadInputsRuleSet` | 200 | Log4j RCE, Java deserialization, PROPFIND, exploitable paths, localhost Host header | +| SQL Database | `AWSManagedRulesSQLiRuleSet` | 200 | SQL injection in query params, body, cookies, URI path | +| IP Reputation | `AWSManagedRulesAmazonIpReputationList` | 25 | Blocks IPs from AWS threat intelligence (MadPot): malicious actors, reconnaissance, DDoS sources | + +**Tier 2: Enable based on use case.** + +| Rule Group | Name | WCU | When to use | +| --------------------------- | --------------------------------------- | --- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Anonymous IP List | `AWSManagedRulesAnonymousIpList` | 50 | Block TOR nodes, anonymous proxies, VPN services, hosting provider IPs. Use when API should not be accessed anonymously | +| Bot Control | `AWSManagedRulesBotControlRuleSet` | 50 | Detect/block scrapers, automated browsers, bad bots. Common ($1/M requests) for basic detection; Targeted ($10/M requests) adds ML-based detection (same WCU, higher per-request cost). **Note**: `CategoryAI` rule blocks all AI bot traffic (both verified and unverified) unlike other category rules; override to Count and use labels (`bot:category:ai:verified` vs `unverified`) for fine-grained control | +| Admin Protection | `AWSManagedRulesAdminProtectionRuleSet` | 100 | Block access to admin URI paths. Use if API has admin endpoints | +| Account Takeover Prevention | `AWSManagedRulesATPRuleSet` | 50 | Credential stuffing protection on login endpoints. Checks against stolen credential databases. Additional fees apply | + +**Tier 3: Custom rules for API-specific protection.** + +| Rule Type | Purpose | +| --------------------- | ----------------------------------------------------------------- | +| Rate-based rules | Per-IP request rate limiting (complements API Gateway throttling) | +| Geo-match rules | Block traffic from regions where you have no customers | +| IP set rules | Allow/deny specific IP ranges | +| Size constraint rules | Custom payload size limits per endpoint | +| Regex pattern rules | Block specific patterns in headers/body (e.g., malformed JWTs) | + +### WAF Best Practices for APIs + +- **Start in Count mode**: Deploy managed rules in Count mode first, analyze CloudWatch metrics and WAF logs, then switch to Block +- **Use labels for custom logic**: Managed rules add labels to requests. Write custom rules that match on labels for fine-grained control (e.g., allow verified bots but block unverified ones) +- **Override specific rules**: If a managed rule causes false positives, override that single rule to Count rather than disabling the entire rule group +- **Web ACL capacity**: Default allocation is 1,500 WCU per web ACL (hard limit 5,000 via support request). WCU above 1,500 incurs additional cost ($0.20 per million requests per 500 WCU block). Plan rule group selection within this budget. Tier 1 rules alone total 1,125 WCU +- **Scope down statements**: Apply expensive rule groups (Bot Control, ATP) only to specific URI paths to reduce cost and false positives +- **WAF + API Gateway throttling**: WAF rate-based rules operate at the edge; API Gateway throttling operates at the service level. Use both for defense in depth +- **WAF token domain**: When CloudFront fronts a WAF-protected API Gateway, CloudFront rewrites the `Host` header to the origin domain. WAF challenge/CAPTCHA tokens are tied to the client-facing domain, causing a domain mismatch at the origin. Fix: add the CloudFront distribution domain to the token domain list in the origin's web ACL, and forward the `aws-waf-token` cookie to the origin + +## CORS Configuration + +### REST API + +- Configure `OPTIONS` method with Mock integration returning CORS headers +- Lambda must ALSO return CORS headers in actual response (proxy integration) +- **Critical**: Add CORS headers to `DEFAULT_4XX` and `DEFAULT_5XX` gateway responses; otherwise errors are blocked by browser +- SAM `Cors` property helps but does NOT cover gateway responses +- Only one origin allowed in MOCK response; use `*` or Lambda integration for dynamic origin. **Security warning**: dynamically reflecting the `Origin` header without validating against an allowlist is functionally equivalent to `*` but also works with credentials. Always validate against an explicit allowlist +- Add `AddDefaultAuthorizerToCorsPreflight: false` to exclude authorizer from OPTIONS + +### HTTP API + +- First-class `CorsConfiguration` property: `allowOrigins`, `allowMethods`, `allowHeaders`, `exposeHeaders`, `maxAge`, `allowCredentials` +- API Gateway automatically handles OPTIONS preflight; no MOCK integration needed +- `*` for AllowOrigins does not work when `AllowCredentials` is true + +### Common Gotchas + +- Forgetting CORS headers on gateway responses means 403/500 errors are blocked by browser +- For private APIs: avoid `x-apigw-api-id` header (triggers preflight that fails); use Host header instead +- `Access-Control-Allow-Origin` must match requesting origin exactly when using credentials + +## HttpOnly Cookie Authentication + +Pattern for preventing XSS token theft: + +1. **OAuth2 callback Lambda**: Exchanges auth code for access token, returns via `Set-Cookie` header with `HttpOnly`, `Secure`, `SameSite=Lax` +2. **Lambda authorizer**: Extracts cookie from request, validates JWT, allows/denies +3. Identity source: `$request.header.cookie` for caching +4. Use `aws-jwt-verify` library; create verifier outside handler for JWKS caching across warm starts + +## API Gateway Architecture + +- API Gateway is a **multi-tenant managed service** running in an AWS-managed VPC; nothing is deployed into customer VPCs +- This is why VPC Links exist: they bridge the gap between the AWS-managed VPC (where API Gateway runs) and your VPC (where private resources live) +- API Gateway has internet connectivity by default, which is why it can reach external endpoints even when your VPC cannot + +## Security at All Layers + +Apply security at every component, not just the data plane (who can call APIs). Also secure the **control plane** (who can modify/deploy APIs). + +## Cache Encryption (REST API) + +- API Gateway cache is **not encrypted at rest by default**; encryption must be explicitly enabled when provisioning the cache +- **Cache is not isolated per API key by default**: one client can receive cached responses generated for another client if the cache key parameters (path, query strings, configured headers) match. To prevent this, add a client-identifying parameter (e.g., API key header) as a cache key +- For APIs handling sensitive data, always enable cache encryption +- Cache encryption can only be set at cache creation time; changing it requires re-provisioning + +## Logging Data Sensitivity + +- Standard execution logging auto-redacts authorization headers, API key values, and similar sensitive parameters +- **Data tracing** (separate option from execution logging) logs full request/response bodies including PII and credentials. AWS recommends against enabling data tracing in production +- Treat CloudWatch log groups containing execution logs as sensitive data: apply appropriate IAM access controls, retention policies, and CloudWatch Logs data protection policies + +## Binary Data Security + +- Register binary media types explicitly (e.g., `image/png`, `application/octet-stream`) +- Avoid `*/*` in binary media types unless intentional, as it treats ALL content as binary +- For file uploads via S3: use presigned URLs for large files instead of proxying through API Gateway (10 MB payload limit) diff --git a/plugins/aws-serverless/skills/api-gateway/references/service-integrations.md b/plugins/aws-serverless/skills/api-gateway/references/service-integrations.md new file mode 100644 index 0000000..19a88a6 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/service-integrations.md @@ -0,0 +1,191 @@ +# AWS Service Integrations + +API Gateway integrates directly with AWS services without Lambda. Two implementation approaches: + +- **REST API and WebSocket API**: Use `Type: AWS` with VTL mapping templates for full request/response transformation. Supports any AWS service action. Both synchronous (wait for response) and asynchronous (fire-and-forget) patterns +- **HTTP API**: Uses first-class integrations (`Type: AWS_PROXY` with `IntegrationSubtype`) and parameter mapping instead of VTL. Supported subtypes: `EventBridge-PutEvents`, `SQS-SendMessage`, `SQS-ReceiveMessage`, `SQS-DeleteMessage`, `SQS-PurgeQueue`, `Kinesis-PutRecord`, `StepFunctions-StartExecution`, `StepFunctions-StartSyncExecution`, `StepFunctions-StopExecution`, `AppConfig-GetConfiguration`. For services not in this list (DynamoDB, SNS, S3), use Lambda proxy instead. See [Integration subtype reference](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-develop-integrations-aws-services-reference.html) + +The patterns below cover the most commonly used service integrations. REST API `Type: AWS` can integrate with any AWS service that exposes an HTTP API; the same approach (URI, IAM role, VTL mapping) applies to services not listed here. For HTTP API first-class integrations, the same services apply but use parameter mapping (`$request.body`, `$request.header`, `$request.path`, `$context`) instead of VTL. + +## EventBridge Integration + +Integrates directly with EventBridge PutEvents API (see [aws-samples/serverless-patterns/apigw-rest-api-eventbridge-sam](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-rest-api-eventbridge-sam)). For a complete SAM template, see [SAM Service Integration Templates — EventBridge](sam-service-integrations.md#direct-aws-service-integration-eventbridge). + +- Use `Type: AWS` integration with URI `arn:aws:apigateway:{region}:events:action/PutEvents` +- Set required headers via `RequestParameters` (e.g., `integration.request.header.X-Amz-Target: "'AWSEvents.PutEvents'"`, `integration.request.header.Content-Type: "'application/x-amz-json-1.1'"`). Alternative: set via VTL `$context.requestOverride.header` in the mapping template, but avoid applying the same header in both places ([double-application causes 5XX](https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-override-request-response-parameters.html)) +- VTL mapping template transforms HTTP requests into EventBridge events. Use `#foreach` to batch multiple events from a single API call +- **Input escaping**: Use `$util.escapeJavaScript($elem.Detail).replaceAll("\\'","'")`, since `escapeJavaScript` over-escapes single quotes which breaks JSON; the `replaceAll` corrects this +- Supports custom event buses (not just default); pass bus name via `!Sub` in the mapping template +- Lambda authorizer can return `custom:clientId` to enrich events with caller identity +- Request validation via API Gateway models +- Use `PassthroughBehavior: NEVER` or `WHEN_NO_TEMPLATES` to reject unmatched content types. Avoid the default `WHEN_NO_MATCH` which passes malformed payloads through to the backend service +- EventBridge rules route events to Kinesis Data Firehose, Lambda, or API destinations +- **Gotcha**: EventBridge does not add newlines between records when forwarding to Firehose; use `\n` in InputTemplate +- **Gotcha**: `PutEvents` can succeed (200) with `FailedEntryCount > 0` (partial failures are silent). Check `$input.path('$.FailedEntryCount')` in the response template and return an error status if non-zero + +## SQS Integration (Async Buffer) + +Integrates directly with SQS SendMessage API to decouple producers from consumers (see [aws-samples/serverless-patterns/apigw-sqs-lambda-iot](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-sqs-lambda-iot)). For a complete SAM template, see [SAM Service Integration Templates — SQS](sam-service-integrations.md#direct-aws-service-integration-sqs). + +- Use `Type: AWS` integration with URI `arn:aws:apigateway:{region}:sqs:path/{account-id}/{queue-name}` +- Two protocol options: + - **AWS query protocol**: Set `Content-Type: 'application/x-www-form-urlencoded'` header in integration request. Mapping template: `Action=SendMessage&MessageBody=$util.urlEncode($input.body)` + - **AWS JSON protocol**: Set `Content-Type: 'application/x-amz-json-1.0'` and `X-Amz-Target: 'AmazonSQS.SendMessage'` headers. Pass JSON body with `QueueUrl` and `MessageBody` +- Set `PassthroughBehavior: NEVER` to reject requests that don't match any mapping template (returns 415 Unsupported Media Type) +- **Always use `$util.urlEncode()`** with the query protocol; special characters in the message body cause `AccessDenied` errors without encoding +- **FIFO queues**: Append `MessageGroupId` to the mapping template: `Action=SendMessage&MessageGroupId=$context.extendedRequestId&MessageBody=$util.urlEncode($input.body)`. Enable content-based deduplication on the queue, or add `MessageDeduplicationId` explicitly +- **KMS-encrypted queues**: IAM execution role needs `kms:GenerateDataKey` and `kms:Decrypt` on the queue's KMS key, otherwise `KMS.AccessDeniedException` +- Lambda consumer processes messages from SQS at its own pace, with built-in retry and DLQ support + +## SNS Integration (Fan-Out / Pub-Sub) + +Integrates directly with SNS Publish API for fan-out to multiple subscribers (see [aws-samples/serverless-patterns/apigw-websocket-api-sns](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-websocket-api-sns)): + +- Use `Type: AWS` integration with URI `arn:aws:apigateway:{region}:sns:action/Publish` +- Uses AWS query protocol: Set `Content-Type: 'application/x-www-form-urlencoded'` header in integration request +- VTL mapping template: `Action=Publish&TopicArn=$util.urlEncode("${TopicArn}")&Message=$util.urlEncode(...)`; always URL-encode both the TopicArn and Message +- Enrich messages with API Gateway context: `$context.connectionId` (WebSocket), `$context.requestTimeEpoch`, `$context.identity.sourceIp` +- **Request validation**: Use API Gateway models to validate the message body before publishing, ensuring only well-formed messages reach SNS +- Subscribers receive messages independently: Lambda, SQS, HTTP/S endpoints, email, SMS. One API call fans out to all +- **KMS-encrypted topics**: IAM execution role needs `kms:GenerateDataKey` and `kms:Decrypt` on the topic's KMS key +- **Message attributes**: Use indexed query parameters for subscription filter policies: `MessageAttributes.entry.1.Name=eventType&MessageAttributes.entry.1.Value.DataType=String&MessageAttributes.entry.1.Value.StringValue=...`. Required for SNS subscription filtering in fan-out architectures +- **FIFO topics**: Append `MessageGroupId` (required) and optionally `MessageDeduplicationId` to the mapping template, same pattern as SQS FIFO queues +- IAM execution role needs `sns:Publish` scoped to the specific topic ARN + +## DynamoDB Integration (Write-Through with Streams) + +Integrates directly with DynamoDB APIs for full CRUD without Lambda. For complete SAM templates (OpenAPI-based and inline), see [SAM Service Integration Templates — DynamoDB Full CRUD](sam-service-integrations.md#direct-aws-service-integration-dynamodb-full-crud). + +- Use `Type: AWS` integration with URI `arn:aws:apigateway:{region}:dynamodb:action/{action}` (supports `GetItem`, `PutItem`, `UpdateItem`, `DeleteItem`, `Query`, and `Scan`) +- VTL mapping template transforms HTTP request into DynamoDB JSON format: + - **Request template**: Maps request body/parameters to DynamoDB item attributes with type descriptors (`S`, `N`, `M`, `L`, etc.) + - **Response template**: Extracts DynamoDB response attributes into clean JSON for the client using `$input.path('$.Item.attribute.S')` +- **Full CRUD pattern** (see [aws-samples/serverless-patterns/apigw-dynamodb-lambda-scheduler-ses-auto-deletion-sam](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-dynamodb-lambda-scheduler-ses-auto-deletion-sam)): + - **Create**: Use `UpdateItem` (not `PutItem`) with `$context.requestId` as auto-generated ID, eliminating client-side ID generation. Set `ReturnValues: ALL_NEW` to return the created item in the response + - **Read**: `GetItem` with key from path parameter `$input.params().path.id` + - **Update**: `UpdateItem` with `UpdateExpression` and `ExpressionAttributeValues` mapped from request body. Use `ReturnValues: ALL_NEW` to return the updated item + - **Delete**: `DeleteItem` with `ReturnValues: ALL_OLD` to return the deleted item for confirmation + - **List**: `Scan` with `#foreach` loop in response template to build JSON array from `$inputRoot.Items`. **Always include a `Limit` parameter** in the Scan template (e.g., `"Limit": 25`) to cap items per request; without it, a single API call can scan the entire table, consuming all provisioned capacity. Support pagination via `ExclusiveStartKey` mapped from a query parameter +- **Request validation**: Use `x-amazon-apigateway-request-validator` with OpenAPI schemas to validate request bodies at the gateway before reaching DynamoDB +- **OpenAPI-based definition**: Define the full API in a separate OpenAPI file and include via `AWS::Include` transform in the SAM template, keeping the API definition clean and separates API spec from infrastructure +- For async event processing, enable **DynamoDB Streams** on the table: + - Stream triggers Lambda function on every insert/update/delete + - Lambda processes changes asynchronously (enrichment, notifications, scheduling, cross-service sync) + - Example chain: API Gateway → DynamoDB (Streams) → Lambda → EventBridge Scheduler → SES for scheduled email reminders + - Stream view type options: `KEYS_ONLY`, `NEW_IMAGE`, `OLD_IMAGE`, `NEW_AND_OLD_IMAGES`; choose based on what the processor needs +- **Gotcha**: DynamoDB reserved keywords (`datetime`, `email`, `status`, `name`, `type`, `data`, etc.) require `ExpressionAttributeNames` with `#placeholder` syntax. This applies in VTL templates too, not just SDK calls +- **Security**: Never interpolate request input into DynamoDB expression strings (`UpdateExpression`, `FilterExpression`, `ProjectionExpression`). Hardcode expression structures in VTL and only map user input into `ExpressionAttributeValues` (`:placeholder` values). Interpolating into expressions allows callers to inject additional clauses that expose unintended attributes or modify other items +- Canary deployments for DynamoDB service integrations are managed at the API Gateway stage level (no Lambda alias to canary) + +## Kinesis Data Streams Integration (High-Throughput Ingestion) + +Integrates directly with Kinesis Data Streams PutRecord/PutRecords APIs for high-volume data ingestion (see [aws-samples/serverless-patterns/apigw-kinesis-lambda](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-kinesis-lambda)). For a complete SAM template, see [SAM Service Integration Templates — Kinesis Data Streams](sam-service-integrations.md#direct-aws-service-integration-kinesis-data-streams). + +- Use `Type: AWS` integration with URI `arn:aws:apigateway:{region}:kinesis:action/{action}` (e.g., `action/PutRecord`, `action/PutRecords`) +- Use `PassthroughBehavior: WHEN_NO_TEMPLATES` to ensure requests are only accepted when a matching mapping template exists +- VTL mapping template constructs the Kinesis payload: + - `StreamName`: Target stream; can be hardcoded via `!Sub` or dynamic via path parameter (`$input.params('stream-name')`) + - `Data`: Base64-encoded record data. Use `$util.base64Encode()` in VTL. Encode the full body (`$input.body`) or a specific JSON field (`$input.json('$.Data')`) + - `PartitionKey`: Determines shard placement. Use a high-cardinality value from the request body (e.g., client ID) for even distribution across shards, or `$context.requestId` as a simple default for uniform distribution +- **PutRecords** for batching: VTL `#foreach` loop transforms an array of items in the request body into multiple Kinesis records in a single API call, reducing round-trips +- **KMS-encrypted streams**: IAM execution role needs `kms:GenerateDataKey` and `kms:Decrypt` on the stream's KMS key +- Downstream consumers: Lambda (event source mapping), Kinesis Data Firehose (delivery to S3/Redshift/OpenSearch), Kinesis Data Analytics, or custom KCL applications +- **Shard limits**: Each shard supports 1,000 records/s or 1 MB/s for writes and 2 MB/s for reads. Plan shard count based on expected ingestion rate. Use on-demand capacity mode to auto-scale shards +- **Gotcha**: `PutRecord` response includes `ShardId` and `SequenceNumber`; map these in the response template for client-side tracking if needed +- **Gotcha**: `PutRecords` can succeed (200) with `FailedRecordCount > 0`. Map `$input.path('$.FailedRecordCount')` in the response template so clients know to retry failed records + +## Step Functions Integration (Workflow Orchestration) + +Integrates directly with Step Functions to orchestrate multi-step workflows without Lambda glue code. For complete SAM templates (REST and WebSocket), see [SAM Service Integration Templates — Step Functions](sam-service-integrations.md#direct-aws-service-integration-step-functions). + +**REST API → Step Functions** (see [aws-samples/serverless-patterns/apigw-rest-stepfunction](https://github.com/aws-samples/serverless-patterns/tree/main/apigw-rest-stepfunction)): + +- Two execution modes available: + - **Asynchronous** (Standard workflow): `action/StartExecution`, which returns execution ARN immediately. Client does not wait for workflow completion. IAM role needs `states:StartExecution` + - **Synchronous** (Express workflow): `action/StartSyncExecution`, which waits for workflow to complete and returns the result in the response. IAM role needs `states:StartSyncExecution`. Must complete within the API Gateway integration timeout (29s default, up to 300s for Regional/Private) +- VTL mapping template passes the request body as workflow input and the state machine ARN: + + ```velocity + #set($data = $util.escapeJavaScript($input.json('$')).replaceAll("\\'","'")) + { + "input": "$data", + "stateMachineArn": "${StateMachineArn}" + } + ``` + +**WebSocket API → Step Functions** (see [aws-samples/serverless-samples/apigw-ws-integrations](https://github.com/aws-samples/serverless-samples/tree/main/apigw-ws-integrations)): + +- Two execution modes via custom routes matched by `routeSelectionExpression`: + - **Synchronous** (Express workflow): `action/StartSyncExecution`, which waits for workflow to complete and returns the result directly to the WebSocket client. **Constrained by the 29-second WebSocket API integration timeout**, not the 5-minute Express workflow maximum. Workflows exceeding 29 seconds will time out at the API Gateway level. Use for short-lived workflows where the client needs the result immediately + - **Asynchronous** (Standard workflow): `action/StartExecution`, which returns the execution ARN immediately. Workflow pushes results back to the WebSocket client via the `@connections` Management API (`POST https://{api-id}.execute-api.{region}.amazonaws.com/{stage}/@connections/{connectionId}`) using an HTTP task state or Lambda task. Pass `$context.connectionId` in the input so the workflow knows which client to notify +- VTL escaping for WebSocket: `$util.escapeJavaScript($input.json("$.data")).replaceAll("\\'","'")` (the `replaceAll` handles single quotes that `escapeJavaScript` over-escapes) +- WebSocket `$connect`/`$disconnect` routes can use direct DynamoDB integration (PutItem/DeleteItem) for connection tracking without Lambda +- IAM roles: `states:StartSyncExecution` for Express, `states:StartExecution` for Standard. Async workflow role also needs `execute-api:ManageConnections` to call back the WebSocket client + +**Express vs Standard workflows**: + +- **Express** (sync): Max 5 minutes, at-least-once execution, lower cost for high-volume short tasks. Good for synchronous REST/WebSocket responses within the API Gateway timeout +- **Standard** (async): Max 1 year, exactly-once execution, full execution history. Good for long-running orchestrations that push results via callback, webhook, or polling + +**Lambda durable functions as alternative**: Durable functions are invoked as regular Lambda integrations (proxy or custom) — no `Type: AWS` service integration or VTL needed. See the [aws-lambda-durable-functions skill](../../aws-lambda-durable-functions/) for details. + +## S3 Integration (File Storage Proxy) + +Acts as an S3 proxy for file upload, download, and listing without Lambda (see [Developer Guide tutorial](https://docs.aws.amazon.com/apigateway/latest/developerguide/integrating-api-with-aws-services-s3.html)): + +- Use `Type: AWS` integration with `Action type: Use path override`. API Gateway forwards requests to S3 REST API path-style (`s3-host-name/{bucket}/{key}`) +- **Resource structure**: `/{folder}` maps to S3 bucket, `/{folder}/{item}` maps to S3 object. Map path parameters in integration request: `method.request.path.folder` → `{bucket}`, `method.request.path.item` → `{object}` +- **Operations**: GET on `/` lists buckets, GET on `/{folder}` lists objects in a bucket, GET on `/{folder}/{item}` downloads an object, PUT on `/{folder}/{item}` uploads an object +- **Binary files** (images, PDFs, etc.): Register media types in `binaryMediaTypes` (e.g., `image/png`), add `Accept` (download) and `Content-Type` (upload) headers to the method request, leave `contentHandling` unset (passthrough behavior); no mapping template for binary content types +- **Payload limit**: 10 MB max through API Gateway. For larger files, generate S3 presigned URLs via Lambda and have the client upload/download directly to S3 (presigned URLs cannot be generated from a direct service integration; Lambda is required) +- Response header mapping: Map `integration.response.header.Content-Type`, `integration.response.header.Content-Length`, and `integration.response.header.Date` to method response headers for proper content delivery +- S3 objects with `/` or special characters in the key must be URL-encoded in the request path (e.g., `test/test.txt` → `test%2Ftest.txt`) +- IAM execution role needs S3 permissions (`s3:GetObject`, `s3:PutObject`, `s3:ListBucket`) scoped to the specific bucket(s) + +## HTTP Integration (Proxy to HTTP Endpoints) + +Forwards requests to any HTTP-accessible endpoint: ALB, NLB, ECS, EC2, on-premises servers, or external third-party APIs: + +- **Two modes**: + - `HTTP_PROXY`: Passes request through to the backend as-is and returns the backend response directly to the client. Minimal configuration, no VTL templates. Available on both REST and HTTP APIs + - `HTTP` (non-proxy): Allows VTL mapping templates to transform request and response. REST API only +- **VPC Link** for private backends: Use `connectionType: VPC_LINK` to reach ALB, NLB, or Cloud Map services inside a VPC without exposing them to the internet. VPC Link v2 supports REST and HTTP APIs (targets ALB and NLB); WebSocket API uses VPC Link v1 (NLB only) +- **Path and parameter passthrough**: Map URL path parameters, query strings, and headers from the method request to the integration request. Use `{proxy+}` greedy path parameter for catch-all routing +- **TLS to backend**: API Gateway validates the backend's TLS certificate by default. If the backend uses a self-signed or private CA certificate, set `insecureSkipVerification: true` on the integration (testing/development only; not recommended for production). Provide the full certificate chain on the backend for proper validation +- **Timeouts**: Configure `timeoutInMillis` on the integration (50ms–29s for REST, 30s hard limit for HTTP API). For backends that may exceed this, consider async patterns +- **Connection reuse**: API Gateway reuses connections to HTTP backends by default for lower latency on subsequent requests +- **No automatic retries**: API Gateway does not retry failed HTTP integration requests. If the backend returns 5xx or the connection times out, the error is returned directly to the client. Implement retry logic on the client side or use SQS as a buffer + +## Mock Integration (No Backend) + +Returns responses directly from API Gateway without calling any backend: + +- Use `Type: MOCK` (no integration URI, no IAM role, no backend needed) +- **Health check endpoints**: Return 200 on `/health` for load balancer or monitoring checks +- **CORS preflight**: Handle `OPTIONS` requests with appropriate CORS headers without invoking Lambda +- **API prototyping**: Define request/response contracts before backends are built. Consumers can develop against the mock +- **Static responses**: Return fixed JSON/XML based on request parameters using VTL mapping templates +- VTL request template must return `{"statusCode": 200}` (or appropriate code) to set the integration response status +- Map different status codes using `IntegrationResponses` with selection patterns + +## Common Patterns + +- **IAM execution role**: Every direct service integration requires an IAM role with the specific action permission (e.g., `sqs:SendMessage`, `dynamodb:PutItem`, `events:PutEvents`, `kinesis:PutRecord`, `states:StartExecution`). Pass the role ARN in the integration `Credentials` field +- **Request validation at the gateway**: Use API Gateway request validators (models) to reject invalid requests before they reach the backend service, reducing cost and protects downstream services +- **Response mapping**: Transform raw AWS service responses into clean API responses using VTL response templates. Map HTTP status codes for error cases in `IntegrationResponses` +- **Lambda invocations support sync and async**: API Gateway Lambda integrations default to synchronous invocation (wait for response). For asynchronous invocation (fire-and-forget, returns 200 immediately while Lambda processes in background), set `X-Amz-Invocation-Type: 'Event'` in the integration request HTTP headers (see [re:Post guide](https://repost.aws/knowledge-center/api-gateway-invoke-lambda)). Async invocation supports Lambda's built-in retry (up to 2 retries) and dead-letter queues. REST API supports this natively via non-proxy integration; HTTP API only supports proxy integrations for Lambda so `X-Amz-Invocation-Type` cannot be set — use a proxy Lambda that invokes the target Lambda asynchronously via the SDK +- **Prevent backend bypass (zero trust)**: Ensure backends can only be reached through API Gateway, not invoked or accessed directly. Apply defense in depth per integration type: + - **Lambda**: Restrict Lambda resource policies to allow invocations only from the API Gateway source ARN + - **VPC Link targets (ALB/NLB)**: Use security groups on the load balancer to accept traffic only from the VPC Link's ENIs, not from arbitrary sources + - **HTTP integrations**: Use mutual TLS, API keys, or signed requests between API Gateway and the backend to authenticate the caller + - **Direct service integrations** (SQS, DynamoDB, etc.): The IAM execution role scopes access — ensure the role is only assumable by API Gateway (`apigateway.amazonaws.com` principal) and follows least-privilege for the specific resources +- **Parameter overriding (REST API)**: Override request/response parameters and status codes at method level: + - `RequestParameters` in Integration: Map method request values to integration request values + - `ResponseParameters` in IntegrationResponses: Map integration response values to method response values + - Override response status: `#set($context.responseOverride.status = 400)` in VTL + - Override request headers: `$context.requestOverride.header.` + - **Gotcha**: Applying override to same parameter twice causes 5XX. Build in a variable first, apply at end +- **Binary media types**: API Gateway request and response payloads can be text or binary (JPEG, GZip, XML, PDF, etc.). Configure `binaryMediaTypes` on the API, specifying content types treated as binary (e.g., `image/png`, `application/octet-stream`). **Avoid `*/*` wildcard**: it treats ALL responses as binary, breaking Lambda proxy integrations that return JSON: + - **Lambda proxy integrations**: Lambda must return the response body as base64-encoded and set `isBase64Encoded: true`. The API must have matching `binaryMediaTypes` configured. Client sends `Accept` header matching a binary media type + - **Non-proxy integrations**: Set `binaryMediaTypes` on the API, or use `contentHandling` on the `Integration` and `IntegrationResponse` resources: `CONVERT_TO_BINARY` (base64-decode text to binary), `CONVERT_TO_TEXT` (base64-encode binary to text), or undefined (passthrough) +- **Local testing limitation**: `sam local start-api` does not support `Type: AWS` service integrations; it only supports Lambda proxy/non-proxy integrations. Test direct service integrations by deploying to a dev stage and using the API Gateway test console (AWS Console → method → Test) to validate VTL mapping templates against sample requests without a full deployment cycle diff --git a/plugins/aws-serverless/skills/api-gateway/references/service-limits.md b/plugins/aws-serverless/skills/api-gateway/references/service-limits.md new file mode 100644 index 0000000..d4064ae --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/service-limits.md @@ -0,0 +1,156 @@ +# Service Limits and Quotas + +**Important**: Values listed here are **default quotas**; many are adjustable via AWS Support or Service Quotas console. Do not use default values for architectural decisions without first checking with your AWS account team for current limits and increase possibilities. Consult the [latest API Gateway quotas page](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html) for up-to-date values as they change over time. + +## REST API Limits + +| Resource | Default Limit | Adjustable | +| ---------------------------------------- | ------------------------------------------------ | -------------------------- | +| Regional APIs per region | 600 | Yes | +| Edge-optimized APIs per region | 120 | Yes | +| Private APIs per region | 600 | Yes | +| Resources per API | 300 | Yes | +| Stages per API | 10 | No | +| Stage variables per stage | 100 | No | +| Authorizers per API | 10 | Yes | +| API keys per region | 10,000 | Yes | +| Usage plans per region | 300 | Yes | +| VPC links per region | 20 | Yes | +| Custom domains (public) per region | 120 | Yes | +| Custom domains (private) per region | 50 | Yes | +| API mappings per domain | 200 | No | +| Base path max length | 300 chars | No | +| **Payload size** | **10 MB** | **No** | +| **Integration timeout** | **50ms - 29s (up to 300s for Regional/Private)** | **Yes (Regional/Private)** | +| Mapping template size | 300 KB | No | +| `#foreach` iterations | 1,000 | No | +| Access log template | 3 KB | No | +| Header values total | 10,240 bytes | No | +| Header values total (private) | 8,000 bytes | No | +| Cache TTL | 0 - 3,600s | No | +| Cached response max | 1,048,576 bytes (1 MB) | No | +| Method ARN length | 1,600 bytes | No | +| Method-level throttle settings per stage | 20 | No | +| Model size | 400 KB | No | +| mTLS truststore | 1,000 certs / 1 MB | No | +| Idle connection timeout | 310s | No | +| API definition import size | 6 MB | No | +| Lambda authorizer result TTL | 0 - 3,600s (default 300s) | No | + +## HTTP API Limits + +| Resource | Default Limit | Adjustable | +| ---------------------------- | -------------------- | ---------- | +| APIs per region | 600 | Yes | +| Routes per API | 300 | No | +| Integrations per API | 300 | No | +| **Integration timeout** | **30s (hard limit)** | **No** | +| **Payload size** | **10 MB** | **No** | +| Stages per API | 10 | Yes | +| Custom domains per region | 120 | Yes | +| Access log entry max | 1 MB | No | +| Authorizers per API | 10 | Yes | +| Audiences per JWT authorizer | 50 | No | +| Scopes per route | 10 | No | +| JWKS endpoint timeout | 1,500ms | No | +| Lambda authorizer timeout | 10,000ms | No | +| VPC links per region | 10 | Yes | + +## WebSocket API Limits + +| Resource | Default Limit | Adjustable | +| --------------------- | ------------- | ---------- | +| Frame size | 32 KB | No | +| Message payload | 128 KB | No | +| Connection duration | 2 hours | No | +| Idle timeout | 10 minutes | No | +| New connections rate | 500/s | Yes | +| Routes per API | 300 | No | +| Authorizer result max | 8 KB | No | +| Integration timeout | 29s | No | + +## Account-Level Throttling + +| Resource | Default Limit | Adjustable | +| ----------------- | ------------- | ---------- | +| Steady-state rate | 10,000 rps | Yes | +| Burst capacity | 5,000 | Yes | + +These apply across all REST APIs, HTTP APIs, WebSocket APIs, and WebSocket callback APIs in a region ([shared quota](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html)). Lower defaults apply in opt-in and newer regions (2,500 rps / 1,250 burst): af-south-1, eu-south-1, ap-southeast-3, me-south-1, ap-south-2, ap-southeast-4, eu-south-2, eu-central-2, il-central-1, ca-west-1, ap-southeast-5, ap-southeast-7, mx-central-1. + +## Management API Rate Limits + +| Resource | Limit | +| -------------------------- | ----------------- | +| Total management API calls | 10 rps / 40 burst | +| Custom domain deletion | 1 per 30 seconds | + +## Cache Sizes (REST API) + +| Size | Monthly Cost (approximate) | +| ------- | -------------------------- | +| 0.5 GB | Low | +| 1.6 GB | | +| 6.1 GB | | +| 13.5 GB | | +| 28.4 GB | | +| 58.2 GB | | +| 118 GB | | +| 237 GB | High | + +## Reserved Paths + +- `/ping` and `/sping` are reserved by API Gateway. Do not use for API resources + +## Gateway Response Types + +| Response Type | Default Status | Customizable | +| ------------------------------ | -------------- | ------------ | +| ACCESS_DENIED | 403 | Yes | +| API_CONFIGURATION_ERROR | 500 | Yes | +| AUTHORIZER_CONFIGURATION_ERROR | 500 | Yes | +| AUTHORIZER_FAILURE | 500 | Yes | +| BAD_REQUEST_PARAMETERS | 400 | Yes | +| BAD_REQUEST_BODY | 400 | Yes | +| DEFAULT_4XX | Varies | Yes | +| DEFAULT_5XX | Varies | Yes | +| EXPIRED_TOKEN | 403 | Yes | +| INTEGRATION_FAILURE | 504 | Yes | +| INTEGRATION_TIMEOUT | 504 | Yes | +| INVALID_API_KEY | 403 | Yes | +| INVALID_SIGNATURE | 403 | Yes | +| MISSING_AUTHENTICATION_TOKEN | 403 | Yes | +| QUOTA_EXCEEDED | 429 | Yes | +| **REQUEST_TOO_LARGE** | **413** | **No** | +| RESOURCE_NOT_FOUND | 404 | Yes | +| THROTTLED | 429 | Yes | +| UNAUTHORIZED | 401 | Yes | +| WAF_FILTERED | 403 | Yes | + +**Note**: REQUEST_TOO_LARGE (413) is the only gateway response that CANNOT be customized. Use DEFAULT_4XX for CORS headers on this response. + +## Feature Availability Matrix + +| Feature | REST API | HTTP API | WebSocket | +| ------------------------ | --------------------- | --------------------- | -------------------------- | +| Caching | Yes | No | No | +| Usage plans / API keys | Yes | No | No | +| AWS WAF | Yes | No | No | +| Request validation | Yes | No | No | +| VTL mapping templates | Yes | No | Yes | +| Resource policies | Yes | No | No | +| Private endpoints | Yes | No | No | +| mTLS | Yes (custom domain) | Yes (custom domain) | Via CloudFront viewer mTLS | +| JWT authorizer | No | Yes | No | +| Cognito authorizer | Yes | Use JWT | No | +| Lambda authorizer | Yes (TOKEN + REQUEST) | Yes (REQUEST, simple) | Yes ($connect) | +| Canary deployments | Yes | No | No | +| Response streaming | Yes | No | No | +| Automatic deployments | No | Yes | No | +| X-Ray tracing | Yes | No | No | +| Execution logging | Yes | No | Yes | +| Access logging | Yes | Yes | Yes | +| Custom gateway responses | Yes | No | No | +| SDK generation | Yes | No | No | +| API documentation | Yes | No | No | +| Client certificates | Yes | No | No | diff --git a/plugins/aws-serverless/skills/api-gateway/references/troubleshooting.md b/plugins/aws-serverless/skills/api-gateway/references/troubleshooting.md new file mode 100644 index 0000000..dccb368 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/troubleshooting.md @@ -0,0 +1,375 @@ +# Troubleshooting Guide + +## Table of Contents + +- [General Approach](#general-approach) +- [HTTP 400 Bad Request](#http-400-bad-request) +- [HTTP 401 Unauthorized](#http-401-unauthorized) +- [HTTP 403 Forbidden](#http-403-forbidden) +- [HTTP 413 Request Too Large](#http-413-request-too-large) +- [HTTP 429 Too Many Requests](#http-429-too-many-requests) +- [HTTP 500 Internal Server Error](#http-500-internal-server-error) +- [HTTP 502 Bad Gateway](#http-502-bad-gateway) +- [HTTP 504 Gateway Timeout](#http-504-gateway-timeout) +- [SSL/TLS and Certificate Issues](#ssltls-and-certificate-issues) +- [CORS Errors](#cors-errors) +- [Lambda Integration Errors](#lambda-integration-errors) +- [VPC and Private API Issues](#vpc-and-private-api-issues) +- [Mapping Template Errors](#mapping-template-errors) +- [WebSocket Issues](#websocket-issues) +- [SQS Integration Errors](#sqs-integration-errors) +- [Useful Debugging Commands](#useful-debugging-commands) + +--- + +## General Approach + +1. Enable execution logging (REST/WebSocket only — [HTTP API supports access logging only](https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-logging.html)) AND access logging before troubleshooting +2. Use `x-amzn-requestid` response header to trace specific requests in execution logs +3. Check enhanced observability variables in access logs to identify which phase failed +4. Use CloudWatch Logs Insights for pattern analysis across many requests + +### Request Phase Order + +WAF -> Authenticate -> Authorizer -> Authorize -> Integration + +Each phase exposes `$context.{phase}.status`, `$context.{phase}.latency`, `$context.{phase}.error` in access logs. + +--- + +## HTTP 400 Bad Request + +### Protocol Mismatch with ALB + +- **Cause**: Sending HTTP to TLS listener or HTTPS to non-TLS listener +- **Fix**: Match protocol (HTTP/HTTPS) to ALB listener type + +### ALB Desync Mode + +- **Cause**: ALB desync mitigation set to "strictest" rejects non-RFC-compliant requests +- **Fix**: Switch to "defensive" mode or ensure requests are RFC-compliant + +### Invalid Request Body + +- **Cause**: Request body fails JSON Schema validation (REST API request validator) +- **Fix**: Check model definition; note `maxItems`/`minItems` are NOT validated + +--- + +## HTTP 401 Unauthorized + +### Lambda Authorizer + +- **Missing token**: Token source header not sent or wrong header name +- **Regex mismatch**: Token fails the Token Validation regex pattern +- **Missing identity sources**: Required headers/query strings not sent (REQUEST type) +- **Fix**: Verify header name, regex pattern, and all identity sources + +### Cognito + +- **Wrong token type**: Use ID token when no scopes configured; access token when scopes configured +- **Expired token**: Check token expiration +- **User pool mismatch**: Verify user pool ID in authorizer config + +--- + +## HTTP 403 Forbidden + +### "Missing Authentication Token" + +- **Cause**: Stage name in URL when using custom domain (stage already mapped) +- **Fix**: Remove stage name from URL path +- Also occurs when: hitting nonexistent resource path, default endpoint disabled +- **Note**: This 403 behavior is REST API specific. HTTP API returns **404 Not Found** for nonexistent paths ([Gateway response types](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gatewayResponse-definition.html)) + +### From VPC (All APIs Fail) + +- **Cause**: Private DNS on `execute-api` VPC endpoint routes ALL API calls through the endpoint +- **Fix**: Use custom domain names for public APIs, or disable private DNS + +### Lambda Authorizer with {proxy+} + +- **Cause**: Authorizer returns IAM policy with path-specific resource ARN. When caching is enabled, that policy is reused for requests to different proxy paths, which are denied by the cached partial policy ([Configure Lambda authorizer caching](https://docs.aws.amazon.com/apigateway/latest/developerguide/configure-api-gateway-lambda-authorization-with-console.html)) +- **Fix**: Return wildcard resource ARN (`*/*`) in policy, or disable authorizer caching + +### Cross-Account IAM + +- **Cause**: Missing either IAM policy (caller account) OR resource policy (API account) +- **Fix**: REST API: configure both. HTTP API: use `sts:AssumeRole` (no resource policies) + +### Resource Policy + +Common causes: + +1. IP not in allow list or is in deny list +2. HTTP method/resource not covered by policy +3. Auth type mismatch (e.g., IAM auth expected but not provided) +4. **API not redeployed after policy change** +5. Wrong condition key (`aws:SourceVpce` vs `aws:SourceVpc`) + +### mTLS 403 + +- Certificate issuer not in truststore +- Insecure signature algorithm (must be SHA-256+) +- Self-signed certificates with insufficient key size (RSA-2048+ or ECDSA-256+ required) + +### WAF Filtered + +- `waf-status: 403` in access logs +- Check WAF rules and web ACL configuration + +--- + +## HTTP 413 Request Too Large + +- **Cause**: Payload exceeds 10 MB limit (REST and HTTP API) ([API Gateway quotas](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html)) +- REQUEST_TOO_LARGE is the only gateway response that [cannot be customized](https://docs.aws.amazon.com/apigateway/latest/developerguide/supported-gateway-response-types.html). Use `DEFAULT_4XX` as a catch-all to add CORS headers for all 4xx errors including 413 +- **For larger payloads**: Use S3 presigned URLs for direct client upload/download + +--- + +## HTTP 429 Too Many Requests + +### "429 Too Many Requests" + +- **Cause**: API rate or burst limits exceeded at account, stage, or method level +- **Fix**: Implement retries with jittered exponential backoff; request limit increase + +### "Limit Exceeded" + +- **Cause**: API quota limits exceeded (daily/weekly/monthly) +- **Fix**: Request quota extension via usage plan or AWS Support + +--- + +## HTTP 500 Internal Server Error + +### Lambda Stage Variable Permission + +- **Cause**: Missing Lambda invoke permission when function referenced via stage variable +- **Fix**: Add resource-based policy: `aws lambda add-permission --function-name --statement-id apigateway --action lambda:InvokeFunction --principal apigateway.amazonaws.com --source-arn ` ([Lambda permissions for API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/getting-started-with-lambda-integration.html#getting-started-with-lambda-integration-add-permission)) + +### VPC Link Issues + +- VPC link in "Failed" state +- Unhealthy NLB targets +- Security group blocks (port 443 TCP) +- NACLs blocking traffic +- TLS certificate mismatch on backend +- **Fix**: Check VPC link status, NLB target health, security groups, backend TLS chain + +### Lambda Authorizer Malformed Response + +- **Cause**: Lambda authorizer returns invalid JSON, missing `principalId`, or policy exceeding ~8 KB. API Gateway returns **500** (not 401/403) — commonly mistaken for a backend issue ([Authorizer output format](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-lambda-authorizer-output.html)) +- **Fix**: Check execution logs for authorizer errors; verify response includes `principalId` and valid `policyDocument` + +### General Lambda Integration + +- Missing Lambda invoke permissions +- Lambda throttled (concurrency limit) +- Incorrect status code mapping (non-proxy integration) +- **Fix**: Add permissions, implement backoff, configure correct mappings + +--- + +## HTTP 502 Bad Gateway + +### Lambda Proxy Integration + +- **Cause**: Lambda response not in required format +- **REST API / payload format 1.0** ([response format](https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-lambda-proxy-integrations.html#api-gateway-simple-proxy-for-lambda-output-format)): + +```json +{ + "statusCode": 200, + "headers": { "Content-Type": "application/json" }, + "body": "{\"key\": \"value\"}", + "isBase64Encoded": false +} +``` + +- `statusCode` must be integer, `body` must be string, `headers` must be object +- **HTTP API payload format 2.0** is more flexible ([response format](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-develop-integrations-lambda.html#http-api-develop-integrations-lambda.response)): `body` can be string or object (auto-serialized to JSON), a bare string return is treated as 200 body, and `cookies` array is supported for Set-Cookie headers +- Also check Lambda permissions and package file permissions + +--- + +## HTTP 504 Gateway Timeout + +### REST API + +- Default: **29 seconds** integration timeout (increasable up to 300s for Regional/Private via quota request) +- Lambda continues running but client receives 504 + +### HTTP API + +- **30 seconds hard limit** (not increasable) +- Returns HTTP 504 while Lambda continues ([Troubleshoot 504 errors](https://repost.aws/knowledge-center/api-gateway-504-errors)) + +### Diagnosis + +1. Check `IntegrationLatency` metric for spikes +2. Use CloudWatch Logs Insights to identify slow requests +3. Enable [X-Ray tracing](https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-xray.html) for detailed latency breakdown (REST API only). For HTTP API, enable X-Ray in Lambda functions and correlate via `$context.integration.requestId` + +### Fixes + +- Optimize Lambda: increase memory, reduce cold starts, use provisioned concurrency +- Check backend health and response times +- For REST API Regional/Private: **request timeout increase** up to 300s via AWS Support +- For better UX or operations exceeding max timeout: consider async patterns (SQS, EventBridge, Step Functions): acknowledge immediately, process in background + +--- + +## SSL/TLS and Certificate Issues + +### PKIX Path Building Failed + +- **Cause**: Incomplete certificate chain, unsupported CA, or self-signed cert on backend +- **Fix**: Provide complete certificate chain on backend; use `insecureSkipVerification=true` for testing only + +### General SSLEngine Problem + +- Unsupported CA, expired certificate, invalid chain, unsupported cipher suite +- **Fix**: Verify CA support, check expiry, ensure valid chain. RSA keys 2048-4096 bits and ECDSA keys are supported + +### VPC Link TLS + +- NLB TLS listener terminates TLS at NLB; TCP listener passes through +- **Fix**: Use TCP listener for end-to-end TLS; TLS listener for NLB-terminated TLS + +### Wrong Certificate Returned + +- **Cause**: DNS record points to stage URL instead of API Gateway domain name target +- **Fix**: Point DNS to correct API Gateway domain name + +### mTLS Certificate Conflicts + +- Multiple CAs with same subject in truststore +- **Fix**: Clean up truststore, remove duplicate/conflicting CA certificates + +--- + +## CORS Errors + +### Missing CORS Headers on Error Responses + +- **Cause**: Gateway responses (4XX/5XX) bypass integration and don't include CORS headers +- **Fix**: Add CORS headers to `DEFAULT_4XX` and `DEFAULT_5XX` gateway responses + +### Proxy Integration Missing CORS + +- **Cause**: API Gateway doesn't add CORS headers in proxy integration +- **Fix**: Lambda/backend must return `Access-Control-Allow-Origin` and other CORS headers + +### Private API Preflight Failure + +- **Cause**: `x-apigw-api-id` header triggers preflight requests that fail +- **Fix**: Use `Host` header instead of `x-apigw-api-id` + +--- + +## Lambda Integration Errors + +### "Invalid permissions on Lambda function" + +- **Fix (console)**: Re-save Lambda integration to auto-add permissions +- **Fix (CLI)**: `aws lambda add-permission --function-name --statement-id apigateway --action lambda:InvokeFunction --principal apigateway.amazonaws.com --source-arn ` +- **Fix (CloudFormation)**: Add `AWS::Lambda::Permission` resource + +### Lambda Authorizer Permission Error + +- Create IAM role with `lambda:InvokeFunction` and set as Lambda Invoke Role on the authorizer + +### Async Invocation + +- REST API: Set `X-Amz-Invocation-Type: 'Event'` in Integration Request HTTP Headers +- HTTP API: Not directly supported. Use proxy Lambda that invokes target Lambda asynchronously + +--- + +## VPC and Private API Issues + +### Connection Checklist + +1. Resource policy allows VPC endpoint or VPC +2. VPC endpoint policy allows the API +3. Security groups allow TCP 443 inbound +4. DNS resolution works (check private DNS setting) + +### Invoke URL Formats (Without Private DNS) + +- Route 53 alias to VPC endpoint +- VPC endpoint public DNS + `Host` header +- VPC endpoint public DNS + `x-apigw-api-id` header + +### On-Premises DNS Resolution + +- Create Route 53 Resolver inbound endpoint in VPC +- Configure on-prem DNS forwarder to forward `amazonaws.com` queries to Resolver + +--- + +## Mapping Template Errors + +### "Invalid mapping expression specified" + +- **Cause**: `{proxy+}` path variable needs URL path parameter mapping +- **Fix**: Define URL path parameter `proxy` mapped from `method.request.path.proxy` + +### "Illegal character in path" + +- **Cause**: Without mapping, API Gateway sends literal `{proxy+}` (containing `{`) +- **Fix**: Same as above. Ensure Endpoint URL uses `{proxy}` (without `+`) + +--- + +## WebSocket Issues + +### 410 GoneException + +- Message sent before connection fully established +- Connection terminated by client +- Invalid connectionId format +- **Fix**: Use `getConnection` before `postToConnection`; do NOT post from $connect route Lambda + +### Connection Errors + +- Missing Lambda permissions, incorrect API URL, backend errors +- WebSocket URL format: `wss://api-id.execute-api.region.amazonaws.com/stage` + +--- + +## SQS Integration Errors + +| Error | Cause | Fix | +| ------------------------- | ------------------------------------------- | ------------------------------------------------- | +| UnknownOperationException | Wrong Content-Type or Action name | Verify Content-Type and action | +| AccessDenied | IAM role missing SQS permissions | Add SQS permissions to role | +| KMS.AccessDeniedException | Missing KMS permissions for encrypted queue | Add KMS permissions | +| SignatureDoesNotMatch | Content-Type header mismatch | Align Content-Type between method and integration | +| InvalidAddress | Queue URL doesn't match region | Use correct regional queue URL | + +--- + +## Useful Debugging Commands + +### Systems Manager Automation + +Use `AWSSupport-TroubleshootAPIGatewayHttpErrors` runbook to automatically analyze CloudWatch logs for 4xx/5xx errors. + +### CloudWatch Logs Insights for Specific Request + +``` +fields @timestamp, @message +| filter @message like "REQUEST_ID_HERE" +| sort @timestamp asc +``` + +### Execution Log Sequence (REST API Only) + +Authorizer -> Usage Plan -> Method Request -> Endpoint Request -> Endpoint Response -> Method Response + +### Tracing with x-amzn-requestid + +Every API Gateway response includes `x-amzn-requestid` header. Use this to search execution logs for the full request trace. diff --git a/plugins/aws-serverless/skills/api-gateway/references/websocket.md b/plugins/aws-serverless/skills/api-gateway/references/websocket.md new file mode 100644 index 0000000..800f559 --- /dev/null +++ b/plugins/aws-serverless/skills/api-gateway/references/websocket.md @@ -0,0 +1,158 @@ +# WebSocket API + +## Route Selection + +- `$connect`: Called when client connects; authorization is enforced only at connection time. Once connected, all subsequent messages bypass authorization checks. Lambda authorizer's cached policy must cover all routes the client will access during the connection lifetime +- `$disconnect`: Best-effort delivery when client disconnects +- `$default`: Fallback for unmatched messages +- Custom routes: Matched by `routeSelectionExpression` (e.g., `$request.body.action`) + +## @connections Management API + +- `POST @connections/{connectionId}`: Send message to client +- `GET @connections/{connectionId}`: Get connection info +- `DELETE @connections/{connectionId}`: Disconnect client +- IAM action: `execute-api:ManageConnections` + +## Session Management + +- Store connectionId **together with user ID** in DynamoDB on `$connect`. When a user reconnects (new connectionId), update the mapping (the user ID stays the same; only the connectionId changes). This allows the session to continue seamlessly across reconnects, network drops, and the 2-hour connection limit +- Use a GSI on user ID to look up the current connectionId for a given user (e.g., to send a targeted message to a specific user regardless of which connectionId they currently hold) +- Use DynamoDB TTL to clean up stale connections +- For anonymous users: client generates random user ID, sends via `Sec-WebSocket-Protocol` header +- Backend should echo `Sec-WebSocket-Protocol` in response, as many browser clients will reject the connection if a requested subprotocol is not echoed back +- Handle `GoneException` from `post_to_connection` to detect stale connections + +## Client Resilience Best Practices + +- **Automatic reconnect**: Clients must implement reconnect logic with exponential backoff to handle network interruptions, server-side disconnects, and the 2-hour maximum connection duration limit. The connection will be dropped at 2 hours regardless of activity. Clients should treat this as expected and reconnect transparently +- **Heartbeat / keep-alive**: Clients should send a periodic heartbeat message (e.g., every 5-9 minutes) to prevent the 10-minute idle timeout from closing the connection. Use a lightweight message on `$default` or a dedicated `heartbeat` route. Without heartbeats, idle connections are silently closed and the client may not detect the drop until the next send fails +- **Connection state recovery**: On reconnect, clients should re-authenticate and restore application state (e.g., re-subscribe to topics, re-join rooms). Store enough context client-side to resume without data loss + +## SAM Template: Basic WebSocket API + +```yaml +WebSocketApi: + Type: AWS::ApiGatewayV2::Api + Properties: + Name: !Sub "${AWS::StackName}-ws" + ProtocolType: WEBSOCKET + RouteSelectionExpression: "$request.body.action" + +ConnectRoute: + Type: AWS::ApiGatewayV2::Route + Properties: + ApiId: !Ref WebSocketApi + RouteKey: $connect + AuthorizationType: NONE + Target: !Sub "integrations/${ConnectIntegration}" + +ConnectIntegration: + Type: AWS::ApiGatewayV2::Integration + Properties: + ApiId: !Ref WebSocketApi + IntegrationType: AWS_PROXY + IntegrationUri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${ConnectFunction.Arn}/invocations" + +DisconnectRoute: + Type: AWS::ApiGatewayV2::Route + Properties: + ApiId: !Ref WebSocketApi + RouteKey: $disconnect + Target: !Sub "integrations/${DisconnectIntegration}" + +DisconnectIntegration: + Type: AWS::ApiGatewayV2::Integration + Properties: + ApiId: !Ref WebSocketApi + IntegrationType: AWS_PROXY + IntegrationUri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${DisconnectFunction.Arn}/invocations" + +DefaultRoute: + Type: AWS::ApiGatewayV2::Route + Properties: + ApiId: !Ref WebSocketApi + RouteKey: $default + Target: !Sub "integrations/${DefaultIntegration}" + +DefaultIntegration: + Type: AWS::ApiGatewayV2::Integration + Properties: + ApiId: !Ref WebSocketApi + IntegrationType: AWS_PROXY + IntegrationUri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${MessageFunction.Arn}/invocations" + +Stage: + Type: AWS::ApiGatewayV2::Stage + Properties: + ApiId: !Ref WebSocketApi + StageName: prod + AutoDeploy: true + +ConnectPermission: + Type: AWS::Lambda::Permission + Properties: + Action: lambda:InvokeFunction + FunctionName: !Ref ConnectFunction + Principal: apigateway.amazonaws.com + SourceArn: !Sub "arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${WebSocketApi}/*/$connect" +``` + +## Lambda: Sending Messages via @connections + +```python +import boto3 +import logging +from botocore.exceptions import ClientError + +logger = logging.getLogger() + +def send_to_connection(domain_name, stage, connection_id, message): + """Send a message to a WebSocket client.""" + apigw = boto3.client( + "apigatewaymanagementapi", + endpoint_url=f"https://{domain_name}/{stage}" + ) + try: + apigw.post_to_connection( + ConnectionId=connection_id, + Data=message.encode("utf-8") + ) + except ClientError as e: + if e.response["Error"]["Code"] == "GoneException": + logger.info("Connection %s is gone, cleaning up", connection_id) + # Remove stale connection from DynamoDB + return False + raise + return True +``` + +## Limits + +- Frame size: 32 KB, message payload: 128 KB +- Connection duration: 2 hours, idle timeout: 10 minutes +- New connections: 500/s (adjustable), routes: 300 +- **Route-level throttling**: Configure `ThrottlingBurstLimit` and `ThrottlingRateLimit` per route via stage `RouteSettings` to protect backend integrations from message floods. Cannot exceed account-level limit +- Account-level throttle: 10,000 rps / 5,000 burst for WebSocket `@connections` callback, per region (adjustable) +- These are default quotas; check [latest limits](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html) and request increases as needed + +## Pricing + +- Connection minutes: charged per minute of connection time +- Messages: charged per million messages (both sent and received) +- The @connections callback API messages count toward message charges +- No minimum fees or upfront commitments +- See [API Gateway pricing](https://aws.amazon.com/api-gateway/pricing/) for current rates + +## Multi-Region WebSocket + +- DynamoDB Global Tables for connection state tracking +- Route 53 latency-based or geo-based routing for initial connection +- ConnectionId is region-specific; messages must be sent via the region that owns the connection +- Cross-region message propagation via EventBridge or DynamoDB Streams + Lambda + +## Related Templates + +- **Basic WebSocket SAM template**: See above +- **Step Functions integration (sync Express + async with @connections callback)**: See `references/sam-service-integrations.md` (Step Functions section) +- **WebSocket access log format**: See `references/observability.md` (WebSocket API Access Log Format section)