Skip to content

tilework-tech/nori-premortem

Repository files navigation

Nori Premortem

Node Version

A system monitoring daemon that intelligently diagnoses machine issues before critical failure using Claude AI.

Installation

npm install -g nori-premortem@latest
// Add config to config.json
nori-premortem --config config.json

Why Premortem?

When a machine dies, you are often left with no real idea what happened and why, because the machine takes everything with it. Traditional monitoring rarely captures meaningful diagnostics, because it makes strong assumptions up front about possible sources of failure and is not able to dynamically adjust based on in-stream information. You can figure out that your system OOM'd, but you won't easily figure out why, or even more important, where in your code the problem came from.

Premortem spawns a Claude agent the moment issues arise, analyzing the system in real-time and streaming diagnostics to a safe backend. Instead of metric graphs, engineers get AI-powered root cause analysis.

Configuration

Create your configuration file from the example template:

cp defaultConfig.example.json defaultConfig.json
# Edit defaultConfig.json with your webhookUrl, anthropicApiKey, and desired thresholds

Example configuration:

{
  "webhookUrl": "https://your-server.com/webhook-endpoint",
  "anthropicApiKey": "sk-ant-your-api-key-here",
  "pollingInterval": 10000,
  "thresholds": {
    "memoryPercent": 90,
    "diskPercent": 85,
    "cpuPercent": 80
  },
  "agentConfig": {
    "customPrompt": "You are diagnosing system performance issues. Focus on memory usage, disk space, CPU utilization, and process behavior."
  },
  "heartbeat": {
    "url": "https://your-server.com/heartbeat-endpoint",
    "interval": 60000,
    "processName": "my-process"
  }
}

Configuration Options

  • webhookUrl (required): HTTP endpoint to receive diagnostic output
    • Must accept POST requests with JSON payloads containing Claude SDK message objects
    • Messages are grouped by session_id field
    • Each message follows the format: {type: string, session_id: string, ...other_fields}
  • anthropicApiKey (required): Your Anthropic API key for Claude
  • pollingInterval (optional, default: 10000): Milliseconds between system checks
  • thresholds (required): At least one threshold must be configured
    • memoryPercent: Trigger when memory usage exceeds this percentage (uses "available" memory, not "used", to avoid false alerts from Linux buffer/cache)
    • diskPercent: Trigger when disk usage exceeds this percentage
    • cpuPercent: Trigger when CPU usage exceeds this percentage
  • agentConfig (optional): Claude agent configuration
    • customPrompt: Additional context prepended to diagnostic prompt (default: null)
    • Note: Model, allowed tools, and max turns are controlled by SDK defaults and not user-configurable
  • heartbeat (optional): Health check configuration
    • url: Endpoint to receive periodic heartbeat signals
    • interval (default: 60000): Milliseconds between heartbeat signals
    • processName: Process name to monitor and report in heartbeat

Usage

Running premortem will:

  1. Validate the Anthropic API key with a test query (fail-fast if invalid)
  2. Create the archive directory at ~/.premortem-logs if it doesn't exist
  3. Validate the archive directory is writable (fail-fast if not)
  4. Start monitoring system metrics
  5. When a threshold is breached, spawn a Claude agent with system context
  6. Stream all agent output to your webhook endpoint
  7. Save complete session transcripts to ~/.premortem-logs/agent-{sessionId}.jsonl
  8. Reset after the agent completes, ready to trigger again

Stop the daemon with Ctrl+C.

Webhook Integration

Premortem streams diagnostic data to any HTTP endpoint that accepts POST requests. This allows integration with existing monitoring infrastructure, logging systems, or custom backends.

Webhook Endpoint Requirements

The configured webhook endpoint must:

  • Accept POST requests with raw Claude SDK message payloads
  • Handle messages grouped by session_id field
  • Be highly available (premortem uses fire-and-forget delivery with no retry logic)

Message Format

Messages are sent as raw Claude SDK output, one message per POST:

{
  "type": "assistant",
  "session_id": "session-abc123",
  "message": {
    "role": "assistant",
    "content": "Analyzing system metrics..."
  }
}

The session_id field groups messages into a single diagnostic transcript artifact on the backend.

Architecture

Daemon (monitoring loop)
  ↓ (threshold breach detected)
Agent SDK (Claude diagnostics)
  ↓ (immediate streaming)
Webhook Endpoint (your server)

Key Design Decisions:

  1. First-breach-only: When multiple thresholds breach, only the first (memory > disk > cpu) triggers
  2. Reset on completion: Agent finish resets daemon state, allowing new breaches to trigger
  3. Fire-and-forget webhooks: No retries - webhook endpoint must be reliable
  4. API key in config: anthropicApiKey stored in config file, set to env before SDK calls

Development

Run tests:

npm test

Watch mode:

npm run test:watch

Build:

npm run build

Troubleshooting

Daemon not starting:

  • Check that anthropicApiKey is valid in config
  • Verify webhook URL is reachable

No agent triggering:

  • Check threshold values - may need to lower them for testing
  • Review daemon logs for system metrics

Webhook not receiving data:

  • Test webhook endpoint separately
  • Check firewall/network settings
  • Remember: no retries, so endpoint must be reliable

License

MIT

About

Push your machines to the max - diagnose machine issues before the crash

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •