Skip to content

HIVE-29464: Rethink MapWork.aliasToPartnInfo - add getDistinctTableDescs() for callers that only need TableDesc objects#6344

Draft
hemanthumashankar0511 wants to merge 2 commits intoapache:masterfrom
hemanthumashankar0511:mapwork-get-tabledescs
Draft

HIVE-29464: Rethink MapWork.aliasToPartnInfo - add getDistinctTableDescs() for callers that only need TableDesc objects#6344
hemanthumashankar0511 wants to merge 2 commits intoapache:masterfrom
hemanthumashankar0511:mapwork-get-tabledescs

Conversation

@hemanthumashankar0511
Copy link
Contributor

What changes were proposed in this pull request?

Added a new method getDistinctTableDescs() in MapWork that returns the unique TableDesc objects used by the map task, and updated configureJobConf to use it.

Before this change, the deduplication logic was sitting inside configureJobConf:

Set<String> processedTables = new HashSet<>();
for (PartitionDesc partition : aliasToPartnInfo.values()) {
    TableDesc tableDesc = partition.getTableDesc();
    if (tableDesc != null && !processedTables.contains(tableDesc.getTableName())) {
        processedTables.add(tableDesc.getTableName());
        PlanUtils.configureJobConf(tableDesc, job);
    }
}

After this change, that logic lives in getDistinctTableDescs() and configureJobConf just calls it cleanly:

for (TableDesc tableDesc : getDistinctTableDescs()) {
    PlanUtils.configureJobConf(tableDesc, job);
}

Why are the changes needed?

Callers like KafkaDagCredentialSupplier that only care about tables are currently forced to loop through all partitions in aliasToPartnInfo just to get the TableDesc objects. A table can have thousands of partitions but only one TableDesc, so everyone ends up writing the same boilerplate deduplication loop.

This method gives callers a clean way to get unique tables directly from MapWork without reinventing the wheel every time.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I tested this locally by attaching a debugger to the test run and checking two scenarios:

Self-join — I wanted to make sure deduplication wouldn't accidentally skip anything:

SELECT * FROM test t1 JOIN test t2 USING(a);

Confirmed that both aliases point to the exact same TableDesc instance in memory, so the table only gets configured once as expected.

Cross-database join — I wanted to make sure tables with the same name from different databases don't collide:

SELECT * FROM db1.test_cross t1 JOIN db2.test_cross t2 USING(a);

Confirmed that getTableName() returns fully qualified names like db1.test_cross and db2.test_cross as distinct strings, so both tables get configured correctly.

…unique TableDesc objects without iterating partitions
@hemanthumashankar0511 hemanthumashankar0511 marked this pull request as ready for review March 3, 2026 05:19
@abstractdog
Copy link
Contributor

abstractdog commented Mar 3, 2026

nice work so far, thanks @hemanthumashankar0511 for taking care of this!
I believe one of the main concerns of this ticket was the getAliasToPartnInfo() method, which simply returns a mutable collection, leaving us totally unsure where this collection is actually touched, so an ideal solutio removes this method altogether or at least attempts to limit the usage of it, can you please take care of that also?

aliasToPartnInfo.put(alias, partitionDesc);
}

public void putAllPartitionDescs(Map<String, PartitionDesc> partitionDescs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fortunately, we don't need this method, not used at all

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 4, 2026

*/
public Map<String, PartitionDesc> getAliasToPartnInfo() {
return aliasToPartnInfo;
public Collection<PartitionDesc> getPartitionDescs() {
Copy link
Contributor

@abstractdog abstractdog Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this collection returned by this method is mainly used for iterating: is it possible to return an Iterator instead of copying the whole collection? unfortunately, copying it can be costly, and we could never now how heavily use that now or in the future?

return;
}
if (aliasToPartnInfo == null) {
aliasToPartnInfo = new LinkedHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rely on the current instance, like:

  private Map<String, PartitionDesc> aliasToPartnInfo = new LinkedHashMap<String, PartitionDesc>();

this ensures that we have an instance and don't need the extra null checks

}

public void removeAlias(String alias) {
if (aliasToPartnInfo != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

}

public void putPartitionDesc(String alias, PartitionDesc partitionDesc) {
if (aliasToPartnInfo == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

}

public boolean hasPartitionDesc(String alias) {
return aliasToPartnInfo != null && aliasToPartnInfo.containsKey(alias);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

}

public int getPartitionCount() {
return aliasToPartnInfo == null ? 0 : aliasToPartnInfo.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

LinkedHashMap<String, PartitionDesc> aliasToPartnInfo) {
this.aliasToPartnInfo = aliasToPartnInfo;
public PartitionDesc getPartitionDesc(String alias) {
return aliasToPartnInfo == null ? null : aliasToPartnInfo.get(alias);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants