-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Problem Statement
Teams running Valkey at scale have no early warning before they hit throughput ceilings. Current tooling — CloudWatch, Grafana, Prometheus — shows where you are, not where you're going. There is no way to answer "when will my instance hit its ops/sec limit?" without manual spreadsheet math. By the time throughput saturation causes latency degradation, the damage is already happening.
Operators need a tool that watches the ops_per_sec trend over time and tells them, in plain English, how much runway they have before they need to scale — or that everything is stable and no action is needed.
Solution
A Throughput Forecasting feature that fits a linear trend over ops_per_sec history and operates in two modes:
- Throughput Trend (no ceiling configured): Shows growth rate and direction only (e.g. "+12% over 6h, rising"). Never shows a time-to-limit estimate, because without a known ceiling any projection would be misleading.
- Throughput Forecast (ceiling configured): Projects forward and estimates when ops/sec will cross the user-defined ceiling, with plain-English output (e.g. "~4h at current growth rate").
The feature ships as a standalone screen (/throughput) with a settings panel, forecast card, and trend chart. Webhook alerts (Pro only) fire when the projected time-to-limit drops below a configurable threshold.
Why linear regression is the right model for v1: Constant upward pressure is the real signal that resources are underprovisioned. Cyclical traffic patterns (business-hours spikes) that wash out to "stable" in a linear fit are actually the correct answer — they indicate resources are provisioned correctly. This is a "weather forecast" tool, not a guarantee.
User Stories
- As an operator, I want to see the current ops/sec growth trend for my Valkey instance, so that I can understand whether throughput is rising, falling, or stable without manual analysis.
- As an operator, I want to configure an ops/sec ceiling for my instance, so that I can get a time-to-limit projection based on my known infrastructure capacity.
- As an operator, I want to see a plain-English estimate of when my instance will hit its ceiling (e.g. "~4h at current growth rate"), so that I can plan scaling actions with lead time.
- As an operator, I want to configure the rolling window for trend analysis (1h, 3h, 6h, 12h, 24h), so that I can tune the sensitivity of the forecast to my traffic patterns.
- As an operator, I want a chart showing historical ops/sec with a trend line projected into the future, so that I can visually understand the trajectory.
- As an operator, I want the trend line rendered as a dashed line distinct from the solid historical data line, so that I can clearly distinguish observed data from projection.
- As an operator, I want a horizontal ceiling reference line on the chart when a ceiling is configured, so that I can visually see where the limit sits relative to current and projected throughput.
- As an operator, I want to see a "Not projected to reach ceiling" message when throughput is flat or declining, so that I know no scaling action is needed.
- As an operator, I want to see a "Ceiling already exceeded" message when current ops/sec is above the ceiling, so that I know immediate action is required.
- As an operator, I want to see a clear message when there is insufficient monitoring history (< 30 minutes), so that I understand why no forecast is available rather than seeing an empty or broken page.
- As an operator, I want to see a live ops/sec counter even when data is insufficient for forecasting, so that the page feels alive while waiting for data to accumulate.
- As an operator, I want the settings panel to be visible even when data is insufficient, so that I can configure the ceiling and window while waiting.
- As an operator, I want settings to auto-save when I change them with an inline confirmation, so that I don't need an explicit save button for quick configuration.
- As an operator, I want to enable/disable throughput forecasting globally, so that I can hide the feature if I don't need it.
- As an operator, I want to enable/disable throughput forecasting per connection, so that I can skip monitoring on dev/test instances.
- As an operator, I want the nav item to be completely hidden when the feature is globally disabled, so that the sidebar doesn't show features I've turned off.
- As an operator, I want to see a disabled banner with a re-enable toggle when a connection has forecasting turned off, so that I can quickly re-enable without navigating to settings.
- As an operator, I want global default settings (rolling window, alert threshold) that are automatically applied when a new connection first accesses throughput forecasting, so that I don't have to configure each connection individually.
- As an operator, I want per-connection settings to override global defaults, so that I can tune individual instances without changing the defaults for everything.
- As an operator, I want a Prometheus gauge (
betterdb_throughput_time_to_limit_seconds) exported when a ceiling is configured, so that I can integrate the forecast into my existing monitoring stack and Grafana dashboards. - As an operator, I want the Prometheus gauge to only appear when a ceiling is configured, so that connections without a ceiling don't pollute my metrics with meaningless
-1values. - As a Pro user, I want a webhook alert (
throughput.limit) when the projected time-to-limit drops below a configurable threshold, so that I get notified before I need to scale. - As a Pro user, I want the webhook alert to fire only on state change (not every poll cycle), so that I don't get spammed with repeated alerts while the condition persists.
- As a Pro user, I want hysteresis on the webhook alert (10% buffer for recovery), so that alerts don't flap when the projection oscillates near the threshold.
- As a Pro user, I want to configure the alert threshold (30m, 1h, 2h, 4h) for how close the time-to-limit must be before firing, so that I can choose my preferred lead time.
- As an operator, I want the global throughput forecasting settings to appear as a tab in the existing Settings page, so that I can manage it alongside other application settings.
- As a Community user, I want full access to the trend/forecast UI, configurable window, and Prometheus export, so that the core feature is available without a paid license.
Implementation Decisions
Architecture
- On-demand + cache: The forecast is computed on API request, not by a dedicated poller. Results are cached in-memory for 60 seconds. Multiple consumers (frontend, Prometheus, webhook loop) share the same cache. This avoids redundant DB queries and regression computation.
- No MultiConnectionPoller: The service does not extend the
MultiConnectionPollerbase class because it reads from the storage layer (existingmemory_snapshots), not from Redis/Valkey directly. It uses its ownsetIntervalonly for the webhook alert loop. - Webhook alert loop: A 60-second interval that only runs when the Pro webhook service is available. It queries only connections that have both
enabled = trueand a ceiling configured, keeping overhead near zero for Community users.
Data Model
- Data source: Existing
memory_snapshotstable, specifically theops_per_seccolumn. Snapshots are already collected every 60 seconds byMemoryAnalyticsService. No new data collection needed. - Global settings: Three new columns added to the existing
app_settingstable (single-row global config):throughput_forecasting_enabled(boolean, default true),throughput_forecasting_default_rolling_window_ms(integer, default 21600000 / 6h),throughput_forecasting_default_alert_threshold_ms(integer, default 7200000 / 2h). - Per-connection settings: New
throughput_settingstable keyed byconnection_id. Columns:enabled,ops_ceiling(nullable),rolling_window_ms,alert_threshold_ms,updated_at. Not added toapp_settingsbecause that table is a single row and per-connection data doesn't fit the model. - Lazy creation: No
throughput_settingsrow exists until first access. When the service finds no row and global is enabled, it creates one from global defaults. This keeps the table clean for connections that never use the feature. Switching to eager creation (on connection add) later requires no schema change.
API Contract
GET /throughput-forecasting/forecast— returnsThroughputForecastfor the connection identified byx-connection-idheaderGET /throughput-forecasting/settings— returnsThroughputSettingsfor the connectionPUT /throughput-forecasting/settings— updates per-connection settings, invalidates forecast cache
Forecast Algorithm
- Simple least-squares linear regression over
(timestamp, opsPerSec)pairs within the rolling window - Growth rate computed as slope in ops/sec/hour
- Growth percent:
((regressionValue at window end) - (regressionValue at window start)) / (regressionValue at window start) * 100. This uses the regression line's predicted values at both ends of the window, not raw snapshot values, to avoid noise sensitivity. - Trend direction: "rising" (>5%), "falling" (<-5%), "stable" (within ±5%). The 5% default avoids noisy classifications at low ops/sec values. This threshold should be revisited if users need finer granularity — it could be made configurable in a future iteration.
- Time-to-limit:
(ceiling - currentPredicted) / slopePerMswhen trend is rising and ceiling is set - Minimum data requirements: 3+ data points spanning 30+ minutes (with 60s polling, that's 30 snapshots)
Webhook Integration
- New
WebhookEventType.THROUGHPUT_LIMIT = 'throughput.limit'added to enum, mapped to Pro tier dispatchThroughputLimitmethod added toIWebhookEventsProServiceinterface and implemented in proprietary service- Uses existing
dispatchThresholdAlertwithisAbove: false(fires when time-to-limit drops below threshold) - Alert state managed by existing LRU cache with 24h TTL and 10% hysteresis factor. On settings change, stale alert state expires naturally rather than being actively cleared.
Frontend
- New standalone page at
/throughputwith four states: globally disabled, per-connection disabled, insufficient data, normal operation - Settings panel at top of page with auto-save (debounce 500ms, inline "Saved" indicator)
- Rolling window presets: 1h, 3h, 6h (default), 12h, 24h
- Chart time range matches the regression window (single control, less confusion)
- Historical data rendered as solid line, projection as dashed line extending to ceiling or 2x window
- Ceiling shown as horizontal reference line when configured
- Alert threshold dropdown only visible when ceiling is set AND user has Pro tier
- Nav item conditionally rendered (hidden when global toggle is off), placed between Cluster and Anomaly Detection
- New "Throughput Forecasting" tab on Settings page for global toggle and defaults
Tier Gating
- Community: Full trend/forecast UI, all settings, configurable window, Prometheus export
- Pro: Everything in Community + webhook alerts (requires ceiling configured)
Testing Decisions
What makes a good test
Tests verify behavior through public interfaces (getForecast, getSettings, updateSettings), not implementation details. The linear regression algorithm is tested indirectly through forecast results, not as a separate unit. A test should survive an internal refactor — if the regression algorithm changes but produces the same forecasts, no tests should break.
Testing approach
- Service tests use the real memory adapter for storage (integration through
StoragePort, no mocks on storage methods).SettingsServiceandConnectionRegistryare mocked as external dependencies. - Webhook tests mock
WebhookDispatcherServicefollowing the existing webhook-pro test pattern. - TDD vertical slices: Tests are written in strict RED-GREEN order, one slice at a time. No horizontal batching (writing all tests first, then all implementation).
Modules tested
- Storage round-trip (memory adapter): save, retrieve, upsert, delete, active settings filtering
- Forecast service: rising/falling/stable trends, ceiling vs no-ceiling modes, ceiling exceeded, insufficient data, lazy settings creation, settings update with cache invalidation, per-connection disable/enable, forecast cache TTL
- Webhook dispatch (proprietary): Pro dispatch with correct parameters (
isAbove: false, correct alert key), Community skip
Prior art
- Memory analytics service tests — same NestJS testing pattern with mocked storage and connection registry
- Webhook dispatcher service tests — threshold alert testing with hysteresis
- Webhook-pro config monitor tests — proprietary service test pattern
Out of Scope
- Cycle-aware forecasting: No diurnal/weekly pattern detection. Linear regression is sufficient for v1 — cyclical patterns washing out to "stable" is correct behavior.
- Multi-metric composite scoring: The "Scaling Readiness View" (separate project item) will combine throughput forecasting with memory, connection, and CPU headroom. This PRD covers throughput only.
- Cache Size Simulator: Related but separate feature that projects memory requirements.
- SQLite/PostgreSQL adapter implementation: Will be implemented alongside the memory adapter but is not the focus of the TDD test suite. The memory adapter provides integration coverage of the StoragePort contract.
- E2E tests: Controller-level HTTP tests are not included in the TDD flow. The service layer has the complex logic; the controller is a thin pass-through.
- Frontend tests: No React component tests in this PRD. Frontend behavior is verified manually.
- Configurable trend direction threshold: The 5% threshold for rising/falling/stable classification is hardcoded for v1. Making it user-configurable is deferred unless feedback indicates a need.
Further Notes
- The
Scaling Readiness View(separate project board item) depends on this feature — it uses throughput forecasting as one of its five composite score dimensions. - The rolling window maximum is 24 hours, producing at most 1,440 data points (60s intervals). This is a trivial query with existing indexes on
(connection_id, timestamp). - Prometheus gauge
betterdb_throughput_time_to_limit_secondsuses -1 for "not applicable" (declining/stable/exceeded) when a ceiling is configured. The gauge is completely absent for connections without a ceiling.