-
Notifications
You must be signed in to change notification settings - Fork 4k
sql: introduce canary full stats rollout and stats_as_of session var
#156385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
22531c4 to
6c87a57
Compare
0d7a246 to
9468dc3
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
9468dc3 to
86c78c8
Compare
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
e9ff20e to
7c3a778
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
a2d7707 to
a205c4c
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
156307: sql: introduce canary stats settings r=ZhouXing19 a=ZhouXing19 Informs: #150015 This PR introduce 2 key configurations for the Canary Statistics Rollout feature. Note that this PR just to introduce the configuration settings. The core implementation for canary stats rollout will be in #156385. ### Table Storage parameter `sql_stats_canary_window` (duration) ```sql CREATE TABLE t (x int) WITH (sql_stats_canary_window = '20s') ``` This duration value determines specifies how long the newly collected statistics will be eligible for selection along with the most recent full statistics for the optimizer. It is needed for the canary statistics rollout feature. Only tables with a non-zero canary window will have canary statistics rollout enabled. Release note (sql change): A new table storage parameter `sql_stats_canary_window` has been introduced to enable gradual rollout of newly collected table statistics. It takes a duration string as the value. When set with a non-negative duration, the new statistics remain in a "canary" state for the specified duration before being promoted to stable. This allows for controlled exposure and intervention opportunities before statistics are fully deployed across all queries. ---- ### Cluster setting `sql.stats.canary_fraction` (float in [0 - 1]) ```sql SET CLUSTER SETTING sql.stats.canary_fraction = 0.2 ``` This `canaryFraction` controls the probabilistic sampling rate for queries participating in the canary statistics rollout feature. It determines what fraction of queries will use "canary statistics" (newly collected stats within their canary window) versus "stable statistics" (previously proven stats). For example, a value of 0.2 means 20% of queries will test canary stats while 80% use stable stats. The selection is atomic per query: if a query is chosen for canary evaluation, it will use canary statistics for ALL tables it references (where available). A query never uses a mix of canary and stable statistics. Since this "dice roll" happens for every non-internal query, the memo would otherwise flip frequently, negating the benefits of the query plan cache and causing performance regressions. To mitigate this, queries selected for the canary path bypass the query plan cache entirely: they neither look up existing cached memos nor invalidate them. Instead, we create a one-time memo used only for that single query execution. This approach assumes sql.stats.canary_fraction will be set to a small value, ensuring that canary queries remain a small fraction of total queries and minimizing the performance impact of recomputation. One exception is that, we don't roll the dice when preapring a statement. It means during statement preparation, `UseCanaryStats` is always false, so the memo cache remains enabled. The rule of thumb is: the cached memo, either in query cache or prepared stmt, are always for stable stats. ### Session Variable `canary_stats_mode` (enum: {auto, off, on}) - `on`: All queries in the session use canary stats for planning - `off`: All queries in the session use stable stats for planning - `auto`: The system decides based on `sql.stats.canary_fraction` for each query execution Release note (sql change): We introduce two new settings to control the use of canary statistics in query planning: 1. Cluster setting `sql.stats.canary_fraction` (float, range [0, 1]): Controls what fraction of queries use "canary statistics" (newly collected stats within their canary window) versus "stable statistics" (previously proven stats). For example, a value of 0.2 means 20% of queries will use canary stats while 80% use stable stats. The selection is atomic per query: if a query is chosen for canary evaluation, it uses canary statistics for ALL tables it references (where available), and it won't use query cache. A query never uses a mix of canary and stable statistics. 2. Session variable `canary_stats_mode` (enum: {auto, off, on}, default: auto): - `on`: All queries in the session use canary stats for planning - `off`: All queries in the session use stable stats for planning - `auto`: The system decides based on `sql.stats.canary_fraction` for each query execution 157146: db-console: add metrics workspace to debug page r=xinhaoz a=xinhaoz This debug page is similar to `Custom Time Series` but allows for exporting and loading of custom time series dashboards. Epic: none Release note: None 157862: decommission: retry on errors for AllocatorCheckRange r=wenyihu6 a=wenyihu6 Fixes: #156849 Release note: decommission pre-check may have failed on transient errors; this is now fixed with a retry loop. --- **decommission: retry on errors for AllocatorCheckRange** Previously, the decommission pre-check would fail for a range if evalStore.AllocatorCheckRange returned an error. However, transient errors, such as throttled stores, are only expected to last about 5 seconds (FailedReservationsTimeout) and can cause the pre-check to fail. This commit adds a retry loop around AllocatorCheckRange to retry on any errors. Alternatively, we could check for throttling errors specifically and retry only on throttling stores, but that would require string or error comparisons, which complicates the code. So we retry just on all errors here given this only affects the decommission pre-check. --- **kv: add TestDecommissionPreCheckRetryThrottledStores** Previously, we made decommission prechecks retry on errors, since some transient issues resolve quickly and shouldn’t cause the precheck to fail. This commit adds a test that verifies the precheck retries when it encounters transient throttled errors. 157927: roachtest: link on `large` pool r=rail a=rickystewart Release note: none Epic: none Co-authored-by: ZhouXing19 <zhouxing@uchicago.edu> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: wenyihu6 <wenyi@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
…indow This commit implements the core logic for canary statistics rollout, allowing gradual deployment of newly collected full statistics. Previously, all queries would immediately use the most recent full statistics, which could cause performance regressions if the new full statistics were inaccurate. The implementation adds a `CanaryWindowSize` field in table descriptors and catalog interfaces to define the canary period, along with logic in the statistics builder to skip "canary" statistics (the latest stats within the canary window) when not using the canary path. The cluster setting `sql.stats.canary_fraction` controls what percentage of queries use canary statistics. Release note (sql change): implement canary full statistics rollout core logic, which is configurable via the table-level storage paramter (`canary_window`) and the cluster setting `sql.stats.canary_fraction`.
… selection This commit adds a new session variable `stats_as_of` that allows controlling statistics selection based on a specific timestamp rather than the current time. Previously, statistics selection was always relative to the current wall clock time, making it difficult to get consistent query plans for historical analysis or testing. This feature is only for debugging and troubleshooting, and should not be used in production. The implementation is also integrated into the existing canary statistics logic to respect the as-of timestamp when determining canary window boundaries. Release note (sql change): adds a new session variable `stats_as_of` that allows controlling statistics selection based on a specific timestamp rather than the current time.
This commit adds tests for usage of canary stats rollout in makeTableStatistics(), which is the main entry point where statistics are selected for query optimization. This commit focuses on unit testing the makeTableStatistics() function and does not include end-to-end logic tests, which would require additional changes to Builder.maybeAnnotateWithEstimates() to support EXPLAIN ANALYZE output showing which statistics were used during planning. To enable testing, this commit adds: - Handler in opttester for setting the canary window storage parameter - Testing knob for controlling the canary fraction setting - Three new test files covering basic canary stats, histogram canary stats, and multi-column canary stats scenarios Release note: None
a205c4c to
8bfdf70
Compare
Informs: #150015
Rebased from #156307
Release note: TBD