HIVE-29464: Rethink MapWork.aliasToPartnInfo - add getDistinctTableDescs() for callers that only need TableDesc objects#6344
Conversation
…unique TableDesc objects without iterating partitions
|
nice work so far, thanks @hemanthumashankar0511 for taking care of this! |
48b3e36 to
717d4b4
Compare
…w aliasToPartnInfo exposure
717d4b4 to
f97641b
Compare
| aliasToPartnInfo.put(alias, partitionDesc); | ||
| } | ||
|
|
||
| public void putAllPartitionDescs(Map<String, PartitionDesc> partitionDescs) { |
There was a problem hiding this comment.
fortunately, we don't need this method, not used at all
|
| */ | ||
| public Map<String, PartitionDesc> getAliasToPartnInfo() { | ||
| return aliasToPartnInfo; | ||
| public Collection<PartitionDesc> getPartitionDescs() { |
There was a problem hiding this comment.
this collection returned by this method is mainly used for iterating: is it possible to return an Iterator instead of copying the whole collection? unfortunately, copying it can be costly, and we could never now how heavily use that now or in the future?
| return; | ||
| } | ||
| if (aliasToPartnInfo == null) { | ||
| aliasToPartnInfo = new LinkedHashMap<>(); |
There was a problem hiding this comment.
can we rely on the current instance, like:
private Map<String, PartitionDesc> aliasToPartnInfo = new LinkedHashMap<String, PartitionDesc>();
this ensures that we have an instance and don't need the extra null checks
| } | ||
|
|
||
| public void removeAlias(String alias) { | ||
| if (aliasToPartnInfo != null) { |
There was a problem hiding this comment.
maybe remove null-check
| } | ||
|
|
||
| public void putPartitionDesc(String alias, PartitionDesc partitionDesc) { | ||
| if (aliasToPartnInfo == null) { |
There was a problem hiding this comment.
maybe remove null-check
| } | ||
|
|
||
| public boolean hasPartitionDesc(String alias) { | ||
| return aliasToPartnInfo != null && aliasToPartnInfo.containsKey(alias); |
There was a problem hiding this comment.
maybe remove null-check
| } | ||
|
|
||
| public int getPartitionCount() { | ||
| return aliasToPartnInfo == null ? 0 : aliasToPartnInfo.size(); |
There was a problem hiding this comment.
maybe remove null-check
| LinkedHashMap<String, PartitionDesc> aliasToPartnInfo) { | ||
| this.aliasToPartnInfo = aliasToPartnInfo; | ||
| public PartitionDesc getPartitionDesc(String alias) { | ||
| return aliasToPartnInfo == null ? null : aliasToPartnInfo.get(alias); |
There was a problem hiding this comment.
maybe remove null-check



What changes were proposed in this pull request?
Added a new method
getDistinctTableDescs()inMapWorkthat returns the uniqueTableDescobjects used by the map task, and updatedconfigureJobConfto use it.Before this change, the deduplication logic was sitting inside
configureJobConf:After this change, that logic lives in
getDistinctTableDescs()andconfigureJobConfjust calls it cleanly:Why are the changes needed?
Callers like
KafkaDagCredentialSupplierthat only care about tables are currently forced to loop through all partitions inaliasToPartnInfojust to get theTableDescobjects. A table can have thousands of partitions but only oneTableDesc, so everyone ends up writing the same boilerplate deduplication loop.This method gives callers a clean way to get unique tables directly from
MapWorkwithout reinventing the wheel every time.Does this PR introduce any user-facing change?
No.
How was this patch tested?
I tested this locally by attaching a debugger to the test run and checking two scenarios:
Self-join — I wanted to make sure deduplication wouldn't accidentally skip anything:
Confirmed that both aliases point to the exact same
TableDescinstance in memory, so the table only gets configured once as expected.Cross-database join — I wanted to make sure tables with the same name from different databases don't collide:
Confirmed that
getTableName()returns fully qualified names likedb1.test_crossanddb2.test_crossas distinct strings, so both tables get configured correctly.