Skip to content

Conversation

@gongxun0928
Copy link
Contributor

@gongxun0928 gongxun0928 commented Oct 16, 2025

Previously, statistics (min-max, sum, count, etc.) were computed synchronously during data insertion, causing significant slowdowns due to heavy computational overhead.

This change introduces an asynchronous approach to maintain statistics:

  • Add a GUC parameter to control statistics collection during writes (disabled by default)
  • Skip statistics computation during INSERT to ensure fast writes
  • Update statistics asynchronously during VACUUM on PAX tables by scanning file metadata
  • Re-read files and refresh statistics only when metadata indicates they are stale
  • A vacuum was implemented. First, it queries the data block file marked for deletion, scans for valid tuples within it, and rewrites them to a new data file.

In this PR, you'll notice changes in many test results related to VACUUM. The main reason is that after data has been updated or deleted, when VACUUM is performed, files marked for deletion will be rewritten—potentially consolidating multiple such files in the process.

It's important to note that this PR does not include the implementation of VACUUM FULL.

create table t1(c1 int, c2 int, c3 int, c4 int, c5 int, c6 int) using pax with(minmax_columns='c1,c2,c3,c4,c5,c6');
set pax.enable_sync_collect_stats to on; -- collect stats synchronously
insert into t1 select i,i,i,i,i,i from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 2733.731 ms (00:02.734)

create table t2(c1 int, c2 int, c3 int, c4 int, c5 int, c6 int) using pax;
insert into t2 select i,i,i,i,i,i from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 1816.836 ms (00:01.817)

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@gongxun0928 gongxun0928 force-pushed the pax/collect-columns-stats-in-bgworker branch 3 times, most recently from aaaecde to 382bc9a Compare October 17, 2025 05:51
@gongxun0928 gongxun0928 force-pushed the pax/collect-columns-stats-in-bgworker branch 5 times, most recently from 9b508a6 to 1237fd6 Compare November 1, 2025 18:09
Previously, statistics (min-max, sum, count, etc.) were computed synchronously
during data insertion, causing significant slowdowns due to heavy computational
overhead.

This change introduces an asynchronous approach to maintain statistics:
- Add a GUC parameter to control statistics collection during writes (disabled by default)
- Skip statistics computation during INSERT to ensure fast writes
- Update statistics asynchronously during VACUUM on PAX tables by scanning file metadata
- Re-read files and refresh statistics only when metadata indicates they are stale
- Vacuum which data files have been marked for deletion
```
create table t1(c1 int, c2 int, c3 int, c4 int, c5 int, c6 int) using pax with(minmax_columns='c1,c2,c3,c4,c5,c6');
set pax.enable_sync_collect_stats to on; -- collect stats synchronously
insert into t1 select i,i,i,i,i,i from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 2733.731 ms (00:02.734)

create table t2(c1 int, c2 int, c3 int, c4 int, c5 int, c6 int) using pax;
insert into t2 select i,i,i,i,i,i from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 1816.836 ms (00:01.817)

```
@gongxun0928 gongxun0928 force-pushed the pax/collect-columns-stats-in-bgworker branch from 1237fd6 to cd81591 Compare November 2, 2025 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants