Idea: Runbook Execution Engine — pup runbooks
#143
Replies: 3 comments
-
|
The idea reminds me a bit of http://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html from AWS (did not use the product, but I heard of it). May be worth perusing the docs to see if there's anything worth taking inspiration from. suggestion: Maybe instead of: this may feel more obvious?: |
Beta Was this translation helpful? Give feedback.
-
|
The prototype was merged! I'm curious how people will use this |
Beta Was this translation helpful? Give feedback.
-
|
In addition a new "extensions" feature was added to extend pup with custom scripts and it will auto inject the credentials into those scripts |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
This proposes adding a runbook execution engine to
pup, enabling teams to define, discover, and execute operational runbooks directly from the CLI. Runbooks could cover deployments, incident remediation, routing fixes, maintenance windows, and any other repeatable operational procedure.Motivation
Operational runbooks today are scattered — Confluence docs, Notion pages, shell scripts, internal wikis. They go stale, they aren't discoverable, and they're not executable.
pupalready has authenticated access to the Datadog API, making it the ideal host for runbooks that combine:Proposed Interface
Storage
Runbooks live in pup's config directory:
Runbook Format
YAML-based with templating (Tera or Handlebars-style
{{ var }}syntax):Sequence: deploy-service
sequenceDiagram actor Operator participant pup participant DatadogAPI as Datadog API participant DDWorkflow as Datadog Workflows participant Shell as Shell Tools Operator->>pup: pup runbooks run deploy-service --set SERVICE=payments --set VERSION=2.4.1 pup->>DatadogAPI: GET /slos?service=payments&env=staging DatadogAPI-->>pup: SLO status pup->>DatadogAPI: GET /incidents?state=active&service=payments DatadogAPI-->>pup: incident list alt Active incidents found pup->>Operator: ⚠️ Active incidents detected. Proceed? Operator->>pup: confirm end pup->>DDWorkflow: POST /workflows/abc123/run {service, version, environment} DDWorkflow-->>pup: execution started loop Poll every 30s (timeout 10m) pup->>DatadogAPI: GET /monitors/98765 DatadogAPI-->>pup: monitor status end opt notify-oncall installed pup->>Shell: notify-oncall --service=payments --version=2.4.1 Shell-->>pup: notified end pup-->>Operator: ✅ Deployment completeExample: Incident Triage Runbook
Sequence: incident-triage
sequenceDiagram actor Operator participant pup participant DatadogAPI as Datadog API participant DDWorkflow as Datadog Workflows participant Shell as Shell Tools Operator->>pup: pup runbooks run incident-triage --set INCIDENT_ID=12345 --set SERVICE=payments pup->>DatadogAPI: GET /incidents/12345 DatadogAPI-->>pup: incident details pup->>DatadogAPI: GET /deployments?service=payments&from=2h DatadogAPI-->>pup: recent deployments pup->>DatadogAPI: GET /logs?query=service:payments status:error&from=30m DatadogAPI-->>pup: recent error logs pup->>DatadogAPI: GET /monitors?tag=service:payments&status=alert DatadogAPI-->>pup: alerting monitors pup->>DDWorkflow: POST /workflows/xyz-auto-mitigation/run {incident_id, service} DDWorkflow-->>pup: mitigation triggered opt diag-tool installed pup->>Shell: diag-tool --service=payments --format=summary Shell-->>pup: diagnostics output end pup-->>Operator: ✅ Triage complete — review output aboveExample: Feature Flag Rollback
Sequence: rollback-feature-flag
sequenceDiagram actor Operator participant pup participant DatadogAPI as Datadog API Operator->>pup: pup runbooks run rollback-feature-flag --set FLAG_KEY=new-checkout --set SERVICE=payments pup->>DatadogAPI: GET /feature-flags/new-checkout DatadogAPI-->>pup: flag state (enabled: true) pup->>DatadogAPI: GET /metrics?query=avg:trace.servlet.request.errors{service:payments}&from=15m DatadogAPI-->>pup: baseline error rate pup->>Operator: Rolling back new-checkout for payments. Proceed? Operator->>pup: confirm pup->>DatadogAPI: PATCH /feature-flags/new-checkout {enabled: false} DatadogAPI-->>pup: flag disabled loop Poll every 30s (timeout 5m) pup->>DatadogAPI: GET /metrics?query=avg:trace.servlet.request.errors{service:payments}&from=5m DatadogAPI-->>pup: current error rate end pup-->>Operator: ✅ Rollback complete — error rate decreasingExample: Redis Maintenance Window
Sequence: maintenance-window
sequenceDiagram actor Operator participant pup participant DatadogAPI as Datadog API participant Shell as Shell Tools Operator->>pup: pup runbooks run maintenance-window --set SERVICE=redis --set DURATION=30m pup->>DatadogAPI: POST /downtimes {scope: service:redis, duration: 30m} DatadogAPI-->>pup: downtime created (ID: 55123) pup->>Shell: lb-cli drain --service=redis Shell-->>pup: drain initiated loop Poll every 10s (timeout 5m) pup->>DatadogAPI: GET /metrics?query=sum:requests{service:redis}&from=1m DatadogAPI-->>pup: request count end pup->>Operator: Traffic drained. Proceed with maintenance on redis? Operator->>pup: confirm Note over Operator,Shell: Operator performs maintenance work pup->>DatadogAPI: DELETE /downtimes/55123 DatadogAPI-->>pup: downtime cancelled pup-->>Operator: ✅ Maintenance window completeStep Kinds
puppupsubcommand; output available for assertions/pollingshell$PATH,optional: trueskips if not found)datadog-workflowconfirmhttpTemplating & Imports
Runbooks can import shared templates to avoid duplication:
Shared templates define reusable step groups, making it easy for platform teams to distribute approved runbook fragments across services.
Open Questions
pup runbooks importsupport Git refs (e.g. a shared internal repo)?puprecord runbook execution history locally (what ran, when, outcome)?kind: secret) to avoid passing tokens on the CLI?pup runbooks run --dry-runto print the resolved plan without executing.pup runbooks listsurface available Datadog Workflows alongside local runbooks?Why
pupIs the Right Host~/.config/pup/) is a natural home for user-defined runbookspupcommands become first-class runbook primitives without any extra setupWould love feedback on the format design, step kinds, templating approach, and whether there's appetite for this as a first-class feature.
Beta Was this translation helpful? Give feedback.
All reactions