Skip to content

Conversation

@ZhouXing19
Copy link
Collaborator

@ZhouXing19 ZhouXing19 commented Oct 28, 2025

Informs: #150015

Rebased from #156307

Release note: TBD

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Oct 30, 2025
@blathers-crl
Copy link

blathers-crl bot commented Oct 31, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@ZhouXing19 ZhouXing19 force-pushed the canary-main branch 5 times, most recently from e9ff20e to 7c3a778 Compare November 13, 2025 19:15
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@ZhouXing19 ZhouXing19 force-pushed the canary-main branch 3 times, most recently from a2d7707 to a205c4c Compare November 14, 2025 21:58
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

craig bot pushed a commit that referenced this pull request Nov 17, 2025
156307: sql: introduce canary stats settings r=ZhouXing19 a=ZhouXing19

Informs: #150015

This PR introduce 2 key configurations for the Canary Statistics Rollout feature. Note that this PR just to introduce the configuration settings. The core implementation for canary stats rollout will be in #156385.

### Table Storage parameter `sql_stats_canary_window` (duration)

```sql
CREATE TABLE t (x int) WITH (sql_stats_canary_window = '20s')
```

This duration value determines specifies how long the newly collected statistics will be eligible for selection along
with the most recent full statistics for the optimizer. It is needed for the canary statistics rollout feature. Only tables with a non-zero canary window will have canary statistics rollout enabled. 

Release note (sql change): A new table storage parameter `sql_stats_canary_window` has been introduced to enable gradual rollout of newly collected table statistics. It takes a duration string as the value. When set with a non-negative duration, the new statistics remain in a "canary" state for the specified duration before being promoted to stable. This allows for controlled exposure and intervention opportunities before statistics are fully deployed across all queries.

----

###  Cluster setting `sql.stats.canary_fraction` (float in [0 - 1])

```sql
SET CLUSTER SETTING sql.stats.canary_fraction = 0.2
```
This `canaryFraction` controls the probabilistic sampling rate for queries participating in the canary statistics rollout feature.
It determines what fraction of queries will use "canary statistics" (newly collected stats within their canary window) versus "stable statistics" (previously proven stats). 

For example, a value of 0.2 means 20% of queries will test canary stats while 80% use stable stats.
The selection is atomic per query: if a query is chosen for canary evaluation, it will use canary statistics for ALL tables it references (where available). A query never uses a mix of canary and stable statistics. 

Since this "dice roll" happens for every non-internal query, the memo would otherwise flip frequently, negating the benefits of the query plan cache and causing performance regressions. To mitigate this, queries selected for the canary path bypass the query plan cache entirely: they neither look up existing cached memos nor invalidate them. Instead, we
create a one-time memo used only for that single query execution.

This approach assumes sql.stats.canary_fraction will be set to a small value, ensuring that canary queries remain a small fraction of total queries and minimizing the performance impact of recomputation.

One exception is that, we don't roll the dice when preapring a statement. It means during statement preparation, `UseCanaryStats` is always false, so the memo cache remains enabled. The rule of thumb is: the cached memo, either in query cache or prepared stmt, are always for stable stats.

### Session Variable `canary_stats_mode` (enum: {auto, off, on})
   - `on`: All queries in the session use canary stats for planning
   - `off`: All queries in the session use stable stats for planning
   - `auto`: The system decides based on `sql.stats.canary_fraction` for
     each query execution

Release note (sql change): We introduce two new settings to control the use of canary statistics in query planning:
1. Cluster setting `sql.stats.canary_fraction` (float, range [0, 1]): Controls what fraction of queries use "canary statistics" (newly collected stats within their canary window) versus "stable statistics" (previously proven stats). For example, a value of 0.2 means 20% of queries will use canary stats while 80% use stable stats. The selection is atomic per query: if a query is chosen for canary evaluation, it uses canary statistics for ALL tables it references (where available), and it won't use query cache. A query never uses a mix of canary and stable statistics.
2. Session variable `canary_stats_mode` (enum: {auto, off, on}, default: auto):
   - `on`: All queries in the session use canary stats for planning
   - `off`: All queries in the session use stable stats for planning
   - `auto`: The system decides based on `sql.stats.canary_fraction` for each query execution

157146: db-console: add metrics workspace to debug page r=xinhaoz a=xinhaoz

This debug page is similar to `Custom Time Series` but allows for exporting and loading of custom time series dashboards.

Epic: none

Release note: None

157862: decommission: retry on errors for AllocatorCheckRange r=wenyihu6 a=wenyihu6

Fixes: #156849
Release note: decommission pre-check may have failed on transient errors; this
is now fixed with a retry loop.

---

**decommission: retry on errors for AllocatorCheckRange**

Previously, the decommission pre-check would fail for a range if
evalStore.AllocatorCheckRange returned an error. However, transient errors, such
as throttled stores, are only expected to last about 5 seconds
(FailedReservationsTimeout) and can cause the pre-check to fail. This commit
adds a retry loop around AllocatorCheckRange to retry on any errors.

Alternatively, we could check for throttling errors specifically and retry only
on throttling stores, but that would require string or error comparisons, which
complicates the code. So we retry just on all errors here given this only
affects the decommission pre-check.

---

**kv: add TestDecommissionPreCheckRetryThrottledStores**

Previously, we made decommission prechecks retry on errors, since some transient
issues resolve quickly and shouldn’t cause the precheck to fail. This commit
adds a test that verifies the precheck retries when it encounters transient
throttled errors.





157927: roachtest: link on `large` pool r=rail a=rickystewart

Release note: none
Epic: none

Co-authored-by: ZhouXing19 <zhouxing@uchicago.edu>
Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Co-authored-by: wenyihu6 <wenyi@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
…indow

This commit implements the core logic for canary statistics rollout,
allowing gradual deployment of newly collected full statistics.
Previously, all queries would immediately use the most recent full
statistics, which could cause performance regressions if the new full
statistics were inaccurate.

The implementation adds a `CanaryWindowSize` field in table descriptors
and catalog interfaces to define the canary period, along with logic in
the statistics builder to skip "canary" statistics (the latest stats
within the canary window) when not using the canary path. The cluster
setting `sql.stats.canary_fraction` controls what percentage of queries
use canary statistics.

Release note (sql change): implement canary full statistics rollout core logic, which
is configurable via the table-level storage paramter
(`canary_window`) and the cluster setting
`sql.stats.canary_fraction`.
… selection

This commit adds a new session variable `stats_as_of` that allows
controlling statistics selection based on a specific timestamp rather
than the current time. Previously, statistics selection was always
relative to the current wall clock time, making it difficult to get
consistent query plans for historical analysis or testing.

This feature is only for debugging and troubleshooting, and should not
be used in production.

The implementation is also integrated into the existing canary
statistics logic to respect the as-of timestamp when determining canary
window boundaries.

Release note (sql change): adds a new session variable `stats_as_of`
that allows controlling statistics selection based on a specific
timestamp rather than the current time.
This commit adds tests for usage of canary stats rollout in
makeTableStatistics(), which is the main entry point where statistics
are selected for query optimization. This commit focuses on unit
testing the makeTableStatistics() function and does not include
end-to-end logic tests, which would require additional changes to
Builder.maybeAnnotateWithEstimates() to support EXPLAIN ANALYZE
output showing which statistics were used during planning.

To enable testing, this commit adds:
- Handler in opttester for setting the canary window storage
parameter
- Testing knob for controlling the canary fraction setting
- Three new test files covering basic canary stats, histogram canary
stats, and multi-column canary stats scenarios

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants