Skip to content

fix: make the dedup operator cover all column types#80

Open
liulx20 wants to merge 3 commits intoalibaba:mainfrom
liulx20:dedup
Open

fix: make the dedup operator cover all column types#80
liulx20 wants to merge 3 commits intoalibaba:mainfrom
liulx20:dedup

Conversation

@liulx20
Copy link
Collaborator

@liulx20 liulx20 commented Mar 18, 2026

Fixes #82

Greptile Summary

This PR extends the DEDUP operator to cover all column types by changing generate_dedup_offset from a void method (that terminated with LOG(FATAL) for unsupported types) to a bool method that returns false to signal "use the generic hash-based fallback." It also adds the previously-missing MSVertexColumn::generate_dedup_offset implementation and removes the row_num parameter from ColumnsUtils::generate_dedup_offset.

Key changes:

  • generate_dedup_offset now returns bool; false means "fall back to hash-based dedup" in dedup.cc
  • MSVertexColumn gains a correct generate_dedup_offset with a null_seen guard
  • ArrowArrayContextColumn's entire type-dispatched dedup implementation is removed; it now silently falls back through the base-class default, which emits LOG(ERROR) on every dedup — this will pollute production logs with false error messages for a valid code path
  • LOG(FATAL)LOG(ERROR) across MSEdgeColumn, ListColumn, StructColumn, and the base class allows graceful fallback, but LOG(ERROR) is still too severe for an expected, handled code path; LOG(WARNING) or lower would be more appropriate
  • The ColumnsUtils::generate_dedup_offset helper has a harmless but redundant resize call after the constructor already sets the correct size
  • The empty if body in dedup.cc (line 37–38) is valid C++ but makes the intent hard to read at a glance

Confidence Score: 3/5

  • Dedup is functionally correct for all column types but the removal of ArrowArrayContextColumn's fast path introduces spurious LOG(ERROR) noise on valid operations, and LOG(ERROR) across other fallback types is misleading throughout.
  • The core logic is sound — the bool-returning generate_dedup_offset pattern works correctly, and MSVertexColumn's new implementation properly handles nulls. However, removing ArrowArrayContextColumn's dedicated implementation without suppressing the base-class LOG(ERROR) will pollute production logs on every Arrow-column dedup. Additionally, LOG(ERROR) is used where a warning or debug log would be appropriate for an expected, handled code path, making genuine errors harder to distinguish in logs.
  • src/execution/common/columns/arrow_context_column.cc (removed fast-path, now logs ERROR on valid ops), include/neug/execution/common/columns/i_context_column.h and all overrides using LOG(ERROR) for graceful fallback paths

Important Files Changed

Filename Overview
src/execution/common/operators/retrieve/dedup.cc Refactors dedup to use a boolean return value pattern — single-column fast path tries generate_dedup_offset and falls back to hash-based dedup on false; has an empty if-body style issue and the fallback silently succeeds after logging a spurious error for Arrow columns.
include/neug/execution/common/columns/columns_utils.h Removes the now-redundant row_num parameter; has a harmless but redundant second resize call after the constructor already sets the correct size.
src/execution/common/columns/arrow_context_column.cc Removes the entire Arrow-type-dispatched generate_dedup_offset implementation; now falls back to the base-class which logs LOG(ERROR) and returns false, triggering the hash-based fallback — correct result but spurious error log on every ArrowArrayContextColumn dedup.
include/neug/execution/common/columns/i_context_column.h Base generate_dedup_offset changed from void/LOG(FATAL) to bool/LOG(ERROR)/return false, enabling graceful fallback; LOG(ERROR) still fires even when caller handles the false return correctly.
src/execution/common/columns/vertex_columns.cc Adds MSVertexColumn::generate_dedup_offset with correct null handling (tracks null_seen flag) and set-based dedup; updates return type to bool for all vertex column implementations.
include/neug/execution/common/columns/edge_columns.h Updates generate_dedup_offset return type to bool; MSEdgeColumn changed from LOG(FATAL) to LOG(ERROR)/return false to enable graceful fallback.
include/neug/execution/common/columns/list_columns.h ListColumn::generate_dedup_offset changed from LOG(FATAL) to LOG(ERROR)/return false; falls back to hash-based dedup in caller.
src/execution/common/columns/struct_columns.cc StructColumn::generate_dedup_offset changed from LOG(FATAL) to LOG(ERROR)/return false; falls back to hash-based dedup in caller.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Dedup called] --> B{cols empty?}
    B -- yes --> C[Return ctx unchanged]
    B -- no --> D{Single column?}
    D -- yes --> E[Call generate_dedup_offset]
    E --> F{Returns true?}
    F -- yes --> G[Fast path: offsets populated]
    F -- "no - LOG ERROR fired" --> H[Clear offsets, use hash fallback]
    D -- "no, multi-column" --> H
    H --> I[Hash fallback: encode each row via get_elem]
    I --> J[offsets = unique row indices]
    G --> K[Build result Context]
    J --> K
    K --> L[reshuffle offsets]
    L --> M[Return ret]

    subgraph FastPathTypes["Fast-path column types"]
        FP1[SLVertexColumn - bitset]
        FP2[MSVertexColumn - set plus null flag - NEW]
        FP3[MLVertexColumn - set]
        FP4[PathColumn - sort dedup vector]
        FP5[ValueColumn - sort or set]
        FP6[SDSLEdgeColumn, BDSLEdgeColumn, SDMLEdgeColumn, BDMLEdgeColumn]
    end

    subgraph FallbackTypes["No fast path - hash fallback"]
        FB1[ArrowArrayContextColumn - base class LOG ERROR]
        FB2[MSEdgeColumn - LOG ERROR]
        FB3[ListColumn - LOG ERROR]
        FB4[StructColumn - LOG ERROR]
    end
Loading

Comments Outside Diff (1)

  1. src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)

    P2 Fragile fallback when generate_dedup_offset fails mid-way

    The new control flow silently falls through to the slow-path else branch whenever generate_dedup_offset returns false. If a future generate_dedup_offset implementation partially populates offsets before returning false (e.g., due to an error mid-loop), the else branch would append additional entries onto the already-partially-populated vector, producing a corrupted, non-deduplicated result.

    All current false-returning implementations happen to leave offsets untouched, so this is not a bug today. But the contract is nowhere documented: there is no offsets.clear() guard at the top of the else branch, and no documented requirement that callers of generate_dedup_offset leave offsets unchanged on failure.

    Consider adding a defensive offsets.clear() at the start of the else branch, or documenting the contract that generate_dedup_offset must not modify offsets on failure:

  2. src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)

    P1 Missing offsets.clear() in fallback branch

    When cols.size() == 1 and generate_dedup_offset returns false, the code falls into the else branch. Currently all false-returning implementations leave offsets untouched (e.g. MSEdgeColumn, ListColumn, StructColumn), so the else block sees an empty vector and fills it correctly. However, the else block never explicitly clears offsets before appending to it.

    If any future generate_dedup_offset implementation partially populates offsets before discovering a failure and returning false (which is perfectly reasonable), the stale entries would be mixed with the fresh ones produced by the else block, causing duplicate or incorrect result rows.

    The fix is to add offsets.clear() at the start of the else block:

  3. include/neug/execution/common/columns/columns_utils.h, line 43 (link)

    P1 Out-of-bounds access when row_num == 0

    offsets.push_back(row_indices[0]) is called unconditionally. If row_num is zero, row_indices is empty and row_indices[0] is undefined behaviour. Although this is pre-existing code, this PR adds several new callers that now return true and reach this path (PathColumn, BDSLEdgeColumn, SDMLEdgeColumn, BDMLEdgeColumn), widening the exposure. A guard is needed:

Last reviewed commit: "fix"

Greptile also left 2 inline comments on this PR.

@liulx20
Copy link
Collaborator Author

liulx20 commented Mar 18, 2026

@greptile

@liulx20
Copy link
Collaborator Author

liulx20 commented Mar 18, 2026

@greptile

@liulx20 liulx20 requested a review from shirly121 March 19, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Engine error: MSVertexColumn not implemented for dedup operator

1 participant