-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29197: Disable vectorization for multi-column COUNT(DISTINCT) #6114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,35 @@ | ||||
| drop table if exists test_vector; | ||||
| create external table test_vector(id string, pid bigint) PARTITIONED BY (full_date int); | ||||
| insert into test_vector (pid, full_date, id) values (1, '20240305', '6150'); | ||||
|
|
||||
| -------------------------------------------------------------------------------- | ||||
| -- 1. Basic COUNT cases (valid in vectorization) | ||||
| -------------------------------------------------------------------------------- | ||||
| SELECT COUNT(pid) AS cnt_col, COUNT(*) AS cnt_star, COUNT(20240305) AS cnt_const, COUNT(DISTINCT pid) as cnt_distinct, COUNT(1) AS CNT | ||||
| FROM test_vector WHERE full_date=20240305; | ||||
| EXPLAIN VECTORIZATION EXPRESSION | ||||
| SELECT COUNT(pid) AS cnt_col, COUNT(*) AS cnt_star, COUNT(20240305) AS cnt_const,COUNT(DISTINCT pid) as cnt_distinct, COUNT(1) AS CNT | ||||
| FROM test_vector WHERE full_date=20240305; | ||||
|
|
||||
| -------------------------------------------------------------------------------- | ||||
| -- 2. COUNT with DISTINCT column + constant (INVALID in vectorization) | ||||
| -------------------------------------------------------------------------------- | ||||
| SELECT COUNT(DISTINCT pid, 20240305) AS CNT FROM test_vector WHERE full_date=20240305; | ||||
| EXPLAIN VECTORIZATION EXPRESSION | ||||
| SELECT COUNT(DISTINCT pid, 20240305) AS CNT FROM test_vector WHERE full_date=20240305; | ||||
|
|
||||
| -------------------------------------------------------------------------------- | ||||
| -- 3. COUNT(DISTINCT pid, full_date) (multi-col distinct → FAIL) | ||||
| -------------------------------------------------------------------------------- | ||||
| SELECT COUNT(DISTINCT pid, full_date) AS CNT FROM test_vector WHERE full_date=20240305; | ||||
| EXPLAIN VECTORIZATION EXPRESSION | ||||
| SELECT COUNT(DISTINCT pid, full_date) AS CNT FROM test_vector WHERE full_date=20240305; | ||||
|
|
||||
| -------------------------------------------------------------------------------- | ||||
| -- 4. COUNT(DISTINCT pid, full_date, id) (multi-col distinct → FAIL) | ||||
deniskuzZ marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
| -------------------------------------------------------------------------------- | ||||
| SELECT COUNT(DISTINCT pid, full_date, id) AS CNT FROM test_vector WHERE full_date=20240305; | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting that it works for you — I’m getting an exception unless I wrap the distinct columns in parentheses.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. COUNT UDAF excepts DISTINCT to be specified, when the parameters are more than 1.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. try it exception |
||||
| EXPLAIN VECTORIZATION EXPRESSION | ||||
| SELECT COUNT(DISTINCT pid, full_date, id) AS CNT FROM test_vector WHERE full_date=20240305; | ||||
|
|
||||
| DROP TABLE test_vector; | ||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -212,7 +212,7 @@ STAGE PLANS: | |
| enabled: true | ||
| enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true | ||
| inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | ||
| notVectorizedReason: GROUPBY operator: Aggregations with > 1 parameter are not supported unless all the extra parameters are constants count([Column[a], Column[b]]) | ||
| notVectorizedReason: GROUPBY operator: Unsupported COUNT DISTINCT with multiple columns: count([Column[a], Column[b]]). Hive only supports COUNT(DISTINCT col) in vectorized execution. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. was the original message not good enough?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. Before, It has covered some cases like count(distinct col1, col2). Not cases like count(distinct col1, constant), count(distinct col1, col2, constant) etc.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. before we supported multi-column aggregations with constant expressions and now we don't? At least that what the message was saying i don't get why are we changing the message? if the issue was related to filter on partition column, it shouldn't change non-partition table behavior
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @asolimando
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. vectorized: true
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @deniskuzZ the message was not changed for other cases. i added a new message for count udf with more than one parameter. now both partition table and non-partition one will have same behavior
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. works fine with partitioned table as well.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the expectation that count(distinct pid, full_date) == count(distinct(pid, full_date)) ? |
||
| vectorized: false | ||
| Reducer 2 | ||
| Execution mode: llap | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.