fix: make the dedup operator cover all column types#80
Open
liulx20 wants to merge 3 commits intoalibaba:mainfrom
Open
fix: make the dedup operator cover all column types#80liulx20 wants to merge 3 commits intoalibaba:mainfrom
liulx20 wants to merge 3 commits intoalibaba:mainfrom
Conversation
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #82
Greptile Summary
This PR extends the
DEDUPoperator to cover all column types by changinggenerate_dedup_offsetfrom avoidmethod (that terminated withLOG(FATAL)for unsupported types) to aboolmethod that returnsfalseto signal "use the generic hash-based fallback." It also adds the previously-missingMSVertexColumn::generate_dedup_offsetimplementation and removes the row_num parameter fromColumnsUtils::generate_dedup_offset.Key changes:
generate_dedup_offsetnow returnsbool;falsemeans "fall back to hash-based dedup" indedup.ccMSVertexColumngains a correctgenerate_dedup_offsetwith anull_seenguardArrowArrayContextColumn's entire type-dispatched dedup implementation is removed; it now silently falls back through the base-class default, which emitsLOG(ERROR)on every dedup — this will pollute production logs with false error messages for a valid code pathLOG(FATAL)→LOG(ERROR)acrossMSEdgeColumn,ListColumn,StructColumn, and the base class allows graceful fallback, butLOG(ERROR)is still too severe for an expected, handled code path;LOG(WARNING)or lower would be more appropriateColumnsUtils::generate_dedup_offsethelper has a harmless but redundantresizecall after the constructor already sets the correct sizeifbody indedup.cc(line 37–38) is valid C++ but makes the intent hard to read at a glanceConfidence Score: 3/5
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Dedup called] --> B{cols empty?} B -- yes --> C[Return ctx unchanged] B -- no --> D{Single column?} D -- yes --> E[Call generate_dedup_offset] E --> F{Returns true?} F -- yes --> G[Fast path: offsets populated] F -- "no - LOG ERROR fired" --> H[Clear offsets, use hash fallback] D -- "no, multi-column" --> H H --> I[Hash fallback: encode each row via get_elem] I --> J[offsets = unique row indices] G --> K[Build result Context] J --> K K --> L[reshuffle offsets] L --> M[Return ret] subgraph FastPathTypes["Fast-path column types"] FP1[SLVertexColumn - bitset] FP2[MSVertexColumn - set plus null flag - NEW] FP3[MLVertexColumn - set] FP4[PathColumn - sort dedup vector] FP5[ValueColumn - sort or set] FP6[SDSLEdgeColumn, BDSLEdgeColumn, SDMLEdgeColumn, BDMLEdgeColumn] end subgraph FallbackTypes["No fast path - hash fallback"] FB1[ArrowArrayContextColumn - base class LOG ERROR] FB2[MSEdgeColumn - LOG ERROR] FB3[ListColumn - LOG ERROR] FB4[StructColumn - LOG ERROR] endComments Outside Diff (1)
src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)generate_dedup_offsetfails mid-wayThe new control flow silently falls through to the slow-path
elsebranch whenevergenerate_dedup_offsetreturnsfalse. If a futuregenerate_dedup_offsetimplementation partially populatesoffsetsbefore returningfalse(e.g., due to an error mid-loop), theelsebranch would append additional entries onto the already-partially-populated vector, producing a corrupted, non-deduplicated result.All current
false-returning implementations happen to leaveoffsetsuntouched, so this is not a bug today. But the contract is nowhere documented: there is nooffsets.clear()guard at the top of theelsebranch, and no documented requirement that callers ofgenerate_dedup_offsetleaveoffsetsunchanged on failure.Consider adding a defensive
offsets.clear()at the start of theelsebranch, or documenting the contract thatgenerate_dedup_offsetmust not modifyoffsetson failure:src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)offsets.clear()in fallback branchWhen
cols.size() == 1andgenerate_dedup_offsetreturnsfalse, the code falls into theelsebranch. Currently all false-returning implementations leaveoffsetsuntouched (e.g.MSEdgeColumn,ListColumn,StructColumn), so theelseblock sees an empty vector and fills it correctly. However, theelseblock never explicitly clearsoffsetsbefore appending to it.If any future
generate_dedup_offsetimplementation partially populatesoffsetsbefore discovering a failure and returningfalse(which is perfectly reasonable), the stale entries would be mixed with the fresh ones produced by theelseblock, causing duplicate or incorrect result rows.The fix is to add
offsets.clear()at the start of theelseblock:include/neug/execution/common/columns/columns_utils.h, line 43 (link)row_num == 0offsets.push_back(row_indices[0])is called unconditionally. Ifrow_numis zero,row_indicesis empty androw_indices[0]is undefined behaviour. Although this is pre-existing code, this PR adds several new callers that now returntrueand reach this path (PathColumn,BDSLEdgeColumn,SDMLEdgeColumn,BDMLEdgeColumn), widening the exposure. A guard is needed:Last reviewed commit: "fix"