Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions TASKS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Alertmanager Boot Timeout Implementation Plan

## Overview
Implement a configurable boot timeout for Alertmanager in HA mode only. During the timeout period, the API server will be available for alert ingestion, but the readiness probe will return NOT READY until the timeout expires and the cluster settles.

## Requirements
- ✅ **HA mode only**: Feature only applies when clustering is enabled (`--cluster.listen-address != ""`)
- ✅ **Zero impact on single replica**: No changes to single replica startup behavior
- ✅ **API available immediately**: API server accepts alerts during boot timeout
- ✅ **Readiness reflects boot state**: `/-/ready` returns 503 during boot timeout
- ✅ **Configurable timeout**: Default 5 minutes, configurable via flag
- ✅ **Comprehensive logging**: Clear progress indication for users
- ✅ **Minimal code changes**: Touch as little existing code as possible

## Implementation Steps

### Step 1: Add cluster boot timeout flag
**File**: `cmd/alertmanager/main.go`
- Add new flag: `cluster.boot-timeout` with 5m default
- Place with other cluster flags for proper namespacing
- Include clear help text indicating HA-only behavior

### Step 2: Create boot manager
**File**: `cluster/boot.go` (new file)
- Create `BootManager` struct to handle boot timeout logic
- Implement methods:
- `NewBootManager(timeout, logger)` - constructor
- `Start()` - begin boot timeout period
- `IsReady()` - check if boot timeout has expired
- `WaitReady(ctx)` - block until boot timeout expires
- Include comprehensive logging with progress updates
- Use existing patterns from cluster package

### Step 3: Modify startup sequence
**File**: `cmd/alertmanager/main.go`
- Create boot manager when clustering is enabled
- Start boot timeout before API server startup
- Move cluster join logic after boot timeout expires
- Preserve existing startup order for single replica mode

### Step 4: Update readiness endpoint
**File**: `ui/web.go`
- Modify `/-/ready` endpoint to check boot state when clustering enabled
- Return 503 (Service Unavailable) during boot timeout
- Return current behavior after boot timeout + cluster ready
- No changes for single replica mode

### Step 5: Enhance cluster status reporting
**Files**: `api/v2/api.go`, `api/v2/models/cluster_status.go`
- Add "booting" status to cluster status enum
- Update status reporting logic in API v2
- Maintain backward compatibility with existing "ready", "settling", "disabled"
- Show "booting" during boot timeout period

### Step 6: Integration and testing
- Verify single replica mode unchanged
- Test HA mode boot sequence
- Validate readiness probe behavior
- Check status API responses
- Confirm logging output

## Technical Details

### New Components
```go
// cluster/boot.go
type BootManager struct {
timeout time.Duration
startTime time.Time
readyc chan struct{}
logger *slog.Logger
}
```

### Modified Startup Sequence (HA mode only)
```
1. Create cluster peer (existing)
2. Create boot manager (NEW)
3. Start boot timeout (NEW)
4. Start API server (existing, now immediate)
5. Wait for boot timeout (NEW)
6. Join cluster (existing, now delayed)
7. Wait for cluster settle (existing)
8. Set ready state (existing)
```

### Flag Addition
```go
clusterBootTimeout = kingpin.Flag("cluster.boot-timeout",
"Time to wait before joining the gossip cluster. During this period, "+
"the API server accepts alerts but readiness probe returns NOT READY. "+
"Only applies when clustering is enabled.").Default("5m").Duration()
```

### Readiness Logic
```go
// Single replica: always ready (current behavior)
// HA mode: ready only after boot timeout + cluster settled
func readyHandler(bootManager *cluster.BootManager, peer cluster.ClusterPeer) {
if peer == nil {
// Single replica - always ready
return 200
}
if !bootManager.IsReady() {
// Still in boot timeout
return 503
}
if !peer.Ready() {
// Cluster not settled
return 503
}
return 200
}
```

## Benefits
- **Controlled startup**: Prevents premature cluster participation
- **Alert availability**: API accepts alerts immediately
- **Clear observability**: Comprehensive logging and status reporting
- **Zero regression**: Single replica mode completely unchanged
- **Flexible configuration**: Adjustable timeout for different environments
- **Backward compatible**: Existing deployments continue working

## Files to Modify
1. `cmd/alertmanager/main.go` - Add flag, modify startup sequence
2. `cluster/boot.go` - New boot manager implementation
3. `ui/web.go` - Update readiness endpoint
4. `api/v2/api.go` - Add boot status reporting
5. `api/v2/models/cluster_status.go` - Add "booting" status (if needed)

## Testing Scenarios
1. **Single replica**: Verify no behavior changes
2. **HA with default timeout**: Test 5-minute boot delay
3. **HA with custom timeout**: Test different timeout values
4. **HA with zero timeout**: Test immediate cluster join (current behavior)
5. **API availability**: Confirm alerts accepted during boot timeout
6. **Readiness probe**: Verify 503 → 200 transition
7. **Status API**: Check "booting" → "settling" → "ready" progression

## Risk Mitigation
- **Gradual rollout**: Feature disabled by default in single replica
- **Escape hatch**: Zero timeout maintains current behavior
- **Comprehensive logging**: Clear visibility into boot process
- **Minimal changes**: Reduces risk of introducing bugs
- **Backward compatibility**: Existing configurations work unchanged
9 changes: 9 additions & 0 deletions api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ type API struct {
inFlightSem chan struct{}
}

// StatusProvider provides enhanced cluster status information.
type StatusProvider interface {
Status() string
}

// Options for the creation of an API object. Alerts, Silences, AlertStatusFunc
// and GroupMutedFunc are mandatory. The zero value for everything else is a safe
// default.
Expand All @@ -62,6 +67,9 @@ type Options struct {
GroupMutedFunc func(routeID, groupKey string) ([]string, bool)
// Peer from the gossip cluster. If nil, no clustering will be used.
Peer cluster.ClusterPeer
// StatusProvider provides enhanced status information including boot state.
// If nil, standard peer status will be used.
StatusProvider StatusProvider
// Timeout for all HTTP connections. The zero value (and negative
// values) result in no timeout.
Timeout time.Duration
Expand Down Expand Up @@ -125,6 +133,7 @@ func New(opts Options) (*API, error) {
opts.GroupMutedFunc,
opts.Silences,
opts.Peer,
opts.StatusProvider,
l.With("version", "v2"),
opts.Registry,
)
Expand Down
18 changes: 17 additions & 1 deletion api/v2/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,15 @@ import (
"github.com/prometheus/alertmanager/types"
)

// statusProvider provides enhanced cluster status information.
type statusProvider interface {
Status() string
}

// API represents an Alertmanager API v2.
type API struct {
peer cluster.ClusterPeer
statusProvider statusProvider
silences *silence.Silences
alerts provider.Alerts
alertGroups groupsFn
Expand Down Expand Up @@ -91,6 +97,7 @@ func NewAPI(
gmf groupMutedFunc,
silences *silence.Silences,
peer cluster.ClusterPeer,
statusProvider statusProvider,
l *slog.Logger,
r prometheus.Registerer,
) (*API, error) {
Expand All @@ -100,6 +107,7 @@ func NewAPI(
alertGroups: gf,
groupMutedFunc: gmf,
peer: peer,
statusProvider: statusProvider,
silences: silences,
logger: l,
m: metrics.NewAlerts(r),
Expand Down Expand Up @@ -197,7 +205,15 @@ func (api *API) getStatusHandler(params general_ops.GetStatusParams) middleware.

// If alertmanager cluster feature is disabled, then api.peers == nil.
if api.peer != nil {
status := api.peer.Status()
var status string

// Use enhanced status provider if available
if api.statusProvider != nil {
status = api.statusProvider.Status()
} else {
// Fall back to peer status
status = api.peer.Status()
}

peers := []*open_api_models.PeerStatus{}
for _, n := range api.peer.Peers() {
Expand Down
169 changes: 169 additions & 0 deletions cluster/boot.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
// Copyright 2025 Prometheus Team
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package cluster

import (
"context"
"log/slog"
"sync"
"time"
)

// BootManager manages the boot timeout for cluster joining.
// It provides a configurable delay before the alertmanager joins the gossip cluster,
// allowing the API server to be ready for alert ingestion while keeping the
// readiness probe in NOT READY state until the boot timeout expires.
type BootManager struct {
timeout time.Duration
startTime time.Time
readyc chan struct{}
logger *slog.Logger
once sync.Once
}

// NewBootManager creates a new boot manager with the specified timeout.
func NewBootManager(timeout time.Duration, logger *slog.Logger) *BootManager {
return &BootManager{
timeout: timeout,
readyc: make(chan struct{}),
logger: logger,
}
}

// Start begins the boot timeout period. This should be called once during startup.
// It starts a goroutine that will close the ready channel after the timeout expires.
func (bm *BootManager) Start() {
bm.once.Do(func() {
bm.startTime = time.Now()
if bm.timeout <= 0 {
// Zero or negative timeout means immediate readiness
bm.logger.Info("Boot timeout disabled, proceeding immediately")
close(bm.readyc)
return
}

bm.logger.Info("Starting boot timeout", "timeout", bm.timeout)
go bm.runBootTimeout()
})
}

// IsReady returns true if the boot timeout has expired.
func (bm *BootManager) IsReady() bool {
select {
case <-bm.readyc:
return true
default:
return false
}
}

// WaitReady blocks until the boot timeout expires or the context is cancelled.
func (bm *BootManager) WaitReady(ctx context.Context) error {
select {
case <-ctx.Done():
return ctx.Err()
case <-bm.readyc:
return nil
}
}

// runBootTimeout runs the boot timeout timer and logs progress.
func (bm *BootManager) runBootTimeout() {
ticker := time.NewTicker(30 * time.Second) // Log progress every 30 seconds
defer ticker.Stop()

deadline := bm.startTime.Add(bm.timeout)

for {
select {
case <-time.After(time.Until(deadline)):
// Timeout expired
elapsed := time.Since(bm.startTime)
bm.logger.Info("Boot timeout completed, ready to join cluster", "elapsed", elapsed)
close(bm.readyc)
return
case <-ticker.C:
// Progress update
elapsed := time.Since(bm.startTime)
remaining := bm.timeout - elapsed
if remaining > 0 {
bm.logger.Info("Boot timeout in progress", "elapsed", elapsed, "remaining", remaining)
}
}
}
}

// CompositeReadinessChecker combines boot manager and cluster peer readiness.
type CompositeReadinessChecker struct {
bootManager *BootManager
peer interface{} // Can be *Peer or ClusterPeer
}

// NewCompositeReadinessChecker creates a readiness checker that considers both boot timeout and cluster readiness.
func NewCompositeReadinessChecker(bootManager *BootManager, peer interface{}) *CompositeReadinessChecker {
return &CompositeReadinessChecker{
bootManager: bootManager,
peer: peer,
}
}

// IsReady returns true only if boot timeout has expired and cluster is ready (if clustering enabled).
func (c *CompositeReadinessChecker) IsReady() bool {
// If no boot manager, we're in single replica mode - always ready
if c.bootManager == nil {
return true
}

// In HA mode, boot timeout must have expired first
if !c.bootManager.IsReady() {
return false
}

// If clustering is enabled, cluster must also be ready
if c.peer != nil {
// Try to cast to *Peer to check readiness
if peer, ok := c.peer.(*Peer); ok {
return peer.Ready()
}
}

// Boot timeout expired and no cluster - ready
return true
}

// Status returns the current status string, considering both boot timeout and cluster state.
func (c *CompositeReadinessChecker) Status() string {
// If no boot manager, we're in single replica mode - disabled
if c.bootManager == nil {
return "disabled"
}

// In HA mode, check boot timeout first
if !c.bootManager.IsReady() {
// Use "settling" during boot timeout to maintain API compatibility
// In the future, this could be "booting" with OpenAPI spec update
return "settling"
}

// Boot timeout expired, check cluster state
if c.peer != nil {
// Try to cast to *Peer to get cluster status
if peer, ok := c.peer.(*Peer); ok {
return peer.Status() // "ready" or "settling"
}
}

// Boot timeout expired and no cluster - ready
return "ready"
}
Loading