open-ch · verejoel · Jul 4, 2025 · Jul 4, 2025 · Jul 4, 2025 · Jul 4, 2025
diff --git a/TASKS.md b/TASKS.md
@@ -0,0 +1,145 @@
+# Alertmanager Boot Timeout Implementation Plan
+
+## Overview
+Implement a configurable boot timeout for Alertmanager in HA mode only. During the timeout period, the API server will be available for alert ingestion, but the readiness probe will return NOT READY until the timeout expires and the cluster settles.
+
+## Requirements
+- ✅ **HA mode only**: Feature only applies when clustering is enabled (`--cluster.listen-address != ""`)
+- ✅ **Zero impact on single replica**: No changes to single replica startup behavior
+- ✅ **API available immediately**: API server accepts alerts during boot timeout
+- ✅ **Readiness reflects boot state**: `/-/ready` returns 503 during boot timeout
+- ✅ **Configurable timeout**: Default 5 minutes, configurable via flag
+- ✅ **Comprehensive logging**: Clear progress indication for users
+- ✅ **Minimal code changes**: Touch as little existing code as possible
+
+## Implementation Steps
+
+### Step 1: Add cluster boot timeout flag
+**File**: `cmd/alertmanager/main.go`
+- Add new flag: `cluster.boot-timeout` with 5m default
+- Place with other cluster flags for proper namespacing
+- Include clear help text indicating HA-only behavior
+
+### Step 2: Create boot manager
+**File**: `cluster/boot.go` (new file)
+- Create `BootManager` struct to handle boot timeout logic
+- Implement methods:
+  - `NewBootManager(timeout, logger)` - constructor
+  - `Start()` - begin boot timeout period
+  - `IsReady()` - check if boot timeout has expired
+  - `WaitReady(ctx)` - block until boot timeout expires
+- Include comprehensive logging with progress updates
+- Use existing patterns from cluster package
+
+### Step 3: Modify startup sequence
+**File**: `cmd/alertmanager/main.go`
+- Create boot manager when clustering is enabled
+- Start boot timeout before API server startup
+- Move cluster join logic after boot timeout expires
+- Preserve existing startup order for single replica mode
+
+### Step 4: Update readiness endpoint
+**File**: `ui/web.go`
+- Modify `/-/ready` endpoint to check boot state when clustering enabled
+- Return 503 (Service Unavailable) during boot timeout
+- Return current behavior after boot timeout + cluster ready
+- No changes for single replica mode
+
+### Step 5: Enhance cluster status reporting
+**Files**: `api/v2/api.go`, `api/v2/models/cluster_status.go`
+- Add "booting" status to cluster status enum
+- Update status reporting logic in API v2
+- Maintain backward compatibility with existing "ready", "settling", "disabled"
+- Show "booting" during boot timeout period
+
+### Step 6: Integration and testing
+- Verify single replica mode unchanged
+- Test HA mode boot sequence
+- Validate readiness probe behavior
+- Check status API responses
+- Confirm logging output
+
+## Technical Details
+
+### New Components
+```go
+// cluster/boot.go
+type BootManager struct {
+    timeout   time.Duration
+    startTime time.Time
+    readyc    chan struct{}
+    logger    *slog.Logger
+}
+```
+
+### Modified Startup Sequence (HA mode only)
+```
+1. Create cluster peer (existing)
+2. Create boot manager (NEW)
+3. Start boot timeout (NEW)
+4. Start API server (existing, now immediate)
+5. Wait for boot timeout (NEW)
+6. Join cluster (existing, now delayed)
+7. Wait for cluster settle (existing)
+8. Set ready state (existing)
+```
+
+### Flag Addition
+```go
+clusterBootTimeout = kingpin.Flag("cluster.boot-timeout", 
+    "Time to wait before joining the gossip cluster. During this period, "+
+    "the API server accepts alerts but readiness probe returns NOT READY. "+
+    "Only applies when clustering is enabled.").Default("5m").Duration()
+```
+
+### Readiness Logic
+```go
+// Single replica: always ready (current behavior)
+// HA mode: ready only after boot timeout + cluster settled
+func readyHandler(bootManager *cluster.BootManager, peer cluster.ClusterPeer) {
+    if peer == nil {
+        // Single replica - always ready
+        return 200
+    }
+    if !bootManager.IsReady() {
+        // Still in boot timeout
+        return 503 
+    }
+    if !peer.Ready() {
+        // Cluster not settled
+        return 503
+    }
+    return 200
+}
+```
+
+## Benefits
+- **Controlled startup**: Prevents premature cluster participation
+- **Alert availability**: API accepts alerts immediately
+- **Clear observability**: Comprehensive logging and status reporting
+- **Zero regression**: Single replica mode completely unchanged
+- **Flexible configuration**: Adjustable timeout for different environments
+- **Backward compatible**: Existing deployments continue working
+
+## Files to Modify
+1. `cmd/alertmanager/main.go` - Add flag, modify startup sequence
+2. `cluster/boot.go` - New boot manager implementation  
+3. `ui/web.go` - Update readiness endpoint
+4. `api/v2/api.go` - Add boot status reporting
+5. `api/v2/models/cluster_status.go` - Add "booting" status (if needed)
+
+## Testing Scenarios
+1. **Single replica**: Verify no behavior changes
+2. **HA with default timeout**: Test 5-minute boot delay
+3. **HA with custom timeout**: Test different timeout values
+4. **HA with zero timeout**: Test immediate cluster join (current behavior)
+5. **API availability**: Confirm alerts accepted during boot timeout
+6. **Readiness probe**: Verify 503 → 200 transition
+7. **Status API**: Check "booting" → "settling" → "ready" progression
+
+## Risk Mitigation
+- **Gradual rollout**: Feature disabled by default in single replica
+- **Escape hatch**: Zero timeout maintains current behavior
+- **Comprehensive logging**: Clear visibility into boot process
+- **Minimal changes**: Reduces risk of introducing bugs
+- **Backward compatibility**: Existing configurations work unchanged
diff --git a/api/api.go b/api/api.go
@@ -46,6 +46,11 @@ type API struct {
 	inFlightSem              chan struct{}
 }
 
+// StatusProvider provides enhanced cluster status information.
+type StatusProvider interface {
+	Status() string
+}
+
 // Options for the creation of an API object. Alerts, Silences, AlertStatusFunc
 // and GroupMutedFunc are mandatory. The zero value for everything else is a safe
 // default.
@@ -62,6 +67,9 @@ type Options struct {
 	GroupMutedFunc func(routeID, groupKey string) ([]string, bool)
 	// Peer from the gossip cluster. If nil, no clustering will be used.
 	Peer cluster.ClusterPeer
+	// StatusProvider provides enhanced status information including boot state.
+	// If nil, standard peer status will be used.
+	StatusProvider StatusProvider
 	// Timeout for all HTTP connections. The zero value (and negative
 	// values) result in no timeout.
 	Timeout time.Duration
@@ -125,6 +133,7 @@ func New(opts Options) (*API, error) {
 		opts.GroupMutedFunc,
 		opts.Silences,
 		opts.Peer,
+		opts.StatusProvider,
 		l.With("version", "v2"),
 		opts.Registry,
 	)

diff --git a/api/v2/api.go b/api/v2/api.go
@@ -52,9 +52,15 @@ import (
 	"github.com/prometheus/alertmanager/types"
 )
 
+// statusProvider provides enhanced cluster status information.
+type statusProvider interface {
+	Status() string
+}
+
 // API represents an Alertmanager API v2.
 type API struct {
 	peer           cluster.ClusterPeer
+	statusProvider statusProvider
 	silences       *silence.Silences
 	alerts         provider.Alerts
 	alertGroups    groupsFn
@@ -91,6 +97,7 @@ func NewAPI(
 	gmf groupMutedFunc,
 	silences *silence.Silences,
 	peer cluster.ClusterPeer,
+	statusProvider statusProvider,
 	l *slog.Logger,
 	r prometheus.Registerer,
 ) (*API, error) {
@@ -100,6 +107,7 @@ func NewAPI(
 		alertGroups:    gf,
 		groupMutedFunc: gmf,
 		peer:           peer,
+		statusProvider: statusProvider,
 		silences:       silences,
 		logger:         l,
 		m:              metrics.NewAlerts(r),
@@ -197,7 +205,15 @@ func (api *API) getStatusHandler(params general_ops.GetStatusParams) middleware.
 
 	// If alertmanager cluster feature is disabled, then api.peers == nil.
 	if api.peer != nil {
-		status := api.peer.Status()
+		var status string
+
+		// Use enhanced status provider if available
+		if api.statusProvider != nil {
+			status = api.statusProvider.Status()
+		} else {
+			// Fall back to peer status
+			status = api.peer.Status()
+		}
 
 		peers := []*open_api_models.PeerStatus{}
 		for _, n := range api.peer.Peers() {

diff --git a/cluster/boot.go b/cluster/boot.go
@@ -0,0 +1,169 @@
+// Copyright 2025 Prometheus Team
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package cluster
+
+import (
+	"context"
+	"log/slog"
+	"sync"
+	"time"
+)
+
+// BootManager manages the boot timeout for cluster joining.
+// It provides a configurable delay before the alertmanager joins the gossip cluster,
+// allowing the API server to be ready for alert ingestion while keeping the
+// readiness probe in NOT READY state until the boot timeout expires.
+type BootManager struct {
+	timeout   time.Duration
+	startTime time.Time
+	readyc    chan struct{}
+	logger    *slog.Logger
+	once      sync.Once
+}
+
+// NewBootManager creates a new boot manager with the specified timeout.
+func NewBootManager(timeout time.Duration, logger *slog.Logger) *BootManager {
+	return &BootManager{
+		timeout: timeout,
+		readyc:  make(chan struct{}),
+		logger:  logger,
+	}
+}
+
+// Start begins the boot timeout period. This should be called once during startup.
+// It starts a goroutine that will close the ready channel after the timeout expires.
+func (bm *BootManager) Start() {
+	bm.once.Do(func() {
+		bm.startTime = time.Now()
+		if bm.timeout <= 0 {
+			// Zero or negative timeout means immediate readiness
+			bm.logger.Info("Boot timeout disabled, proceeding immediately")
+			close(bm.readyc)
+			return
+		}
+
+		bm.logger.Info("Starting boot timeout", "timeout", bm.timeout)
+		go bm.runBootTimeout()
+	})
+}
+
+// IsReady returns true if the boot timeout has expired.
+func (bm *BootManager) IsReady() bool {
+	select {
+	case <-bm.readyc:
+		return true
+	default:
+		return false
+	}
+}
+
+// WaitReady blocks until the boot timeout expires or the context is cancelled.
+func (bm *BootManager) WaitReady(ctx context.Context) error {
+	select {
+	case <-ctx.Done():
+		return ctx.Err()
+	case <-bm.readyc:
+		return nil
+	}
+}
+
+// runBootTimeout runs the boot timeout timer and logs progress.
+func (bm *BootManager) runBootTimeout() {
+	ticker := time.NewTicker(30 * time.Second) // Log progress every 30 seconds
+	defer ticker.Stop()
+
+	deadline := bm.startTime.Add(bm.timeout)
+
+	for {
+		select {
+		case <-time.After(time.Until(deadline)):
+			// Timeout expired
+			elapsed := time.Since(bm.startTime)
+			bm.logger.Info("Boot timeout completed, ready to join cluster", "elapsed", elapsed)
+			close(bm.readyc)
+			return
+		case <-ticker.C:
+			// Progress update
+			elapsed := time.Since(bm.startTime)
+			remaining := bm.timeout - elapsed
+			if remaining > 0 {
+				bm.logger.Info("Boot timeout in progress", "elapsed", elapsed, "remaining", remaining)
+			}
+		}
+	}
+}
+
+// CompositeReadinessChecker combines boot manager and cluster peer readiness.
+type CompositeReadinessChecker struct {
+	bootManager *BootManager
+	peer        interface{} // Can be *Peer or ClusterPeer
+}
+
+// NewCompositeReadinessChecker creates a readiness checker that considers both boot timeout and cluster readiness.
+func NewCompositeReadinessChecker(bootManager *BootManager, peer interface{}) *CompositeReadinessChecker {
+	return &CompositeReadinessChecker{
+		bootManager: bootManager,
+		peer:        peer,
+	}
+}
+
+// IsReady returns true only if boot timeout has expired and cluster is ready (if clustering enabled).
+func (c *CompositeReadinessChecker) IsReady() bool {
+	// If no boot manager, we're in single replica mode - always ready
+	if c.bootManager == nil {
+		return true
+	}
+
+	// In HA mode, boot timeout must have expired first
+	if !c.bootManager.IsReady() {
+		return false
+	}
+
+	// If clustering is enabled, cluster must also be ready
+	if c.peer != nil {
+		// Try to cast to *Peer to check readiness
+		if peer, ok := c.peer.(*Peer); ok {
+			return peer.Ready()
+		}
+	}
+
+	// Boot timeout expired and no cluster - ready
+	return true
+}
+
+// Status returns the current status string, considering both boot timeout and cluster state.
+func (c *CompositeReadinessChecker) Status() string {
+	// If no boot manager, we're in single replica mode - disabled
+	if c.bootManager == nil {
+		return "disabled"
+	}
+
+	// In HA mode, check boot timeout first
+	if !c.bootManager.IsReady() {
+		// Use "settling" during boot timeout to maintain API compatibility
+		// In the future, this could be "booting" with OpenAPI spec update
+		return "settling"
+	}
+
+	// Boot timeout expired, check cluster state
+	if c.peer != nil {
+		// Try to cast to *Peer to get cluster status
+		if peer, ok := c.peer.(*Peer); ok {
+			return peer.Status() // "ready" or "settling"
+		}
+	}
+
+	// Boot timeout expired and no cluster - ready
+	return "ready"
+}