feat: add row_num interface to IRecordBatchSupplier#79
Open
liulx20 wants to merge 4 commits intoalibaba:mainfrom
Open
feat: add row_num interface to IRecordBatchSupplier#79liulx20 wants to merge 4 commits intoalibaba:mainfrom
IRecordBatchSupplier#79liulx20 wants to merge 4 commits intoalibaba:mainfrom
Conversation
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes
Greptile Summary
This PR adds a
row_num()interface toIRecordBatchSupplierand uses it ininsert_vertices_implto pre-allocate indexer capacity before iterating batches, replacing the previous per-batch growth checks. While the intent is sound (avoid repeated reallocation), the implementation has several correctness and robustness problems that should be resolved before merging.Key issues found:
ODPSStreamRecordBatchSupplierandODPSTableRecordBatchSupplierinodps_fragment_loader.hinherit fromIRecordBatchSupplierbut do not implement the new pure virtualrow_num(). The build will fail until these are updated.vertex_table.h:280-285) —supplier->row_num()returnsint64_t; adding it tosize_twith no sign-check means a negative return (e.g. uninitializedrow_num_) wraps toSIZE_MAX, and the capacity growth loop can never terminate.row_num()is wrong (vertex_table.h) — The per-batch capacity guard was removed entirely. If the upfront count is inaccurate, there is no fallback to prevent inserting past the allocated capacity.row_num_uninitialized (loader_utils.cc:242) —CSVStreamRecordBatchSupplierdoes not initializerow_num_in its constructor initializer list; if the constructor exits early the field holds an indeterminate value.nullptrforarrow::StopToken(loader_utils.cc:257) —CountRowsAsyncexpects a value-typeStopToken, not a pointer; this implicit conversion may not compile on all Arrow versions.loader_utils.cc) and double scanner scan (reader.cc) — The CSV file is opened and read twice in the constructor; the Arrow scanner is fully iterated twice. For large datasets this doubles construction time.reader.cc:191-194) — TheWARNINGlog says "Proceeding without row count" but the very next line throws an exception, making a pre-allocation hint a hard failure.SupplierWrapperWithFirstBatch::row_num()undercounts (loader_utils.h:115) —first_batch_rows are not included in the sum whenhas_first_batch_istrue.GeneratedRecordBatchSupplier::row_num()shrinks (tests/unittest/utils.h:43) —batches_is drained bypop_back(), so the count decreases as batches are consumed rather than reflecting the original total.Confidence Score: 1/5
include/neug/storages/loader/odps_fragment_loader.h(compile break),include/neug/storages/graph/vertex_table.h(infinite loop risk),src/storages/loader/loader_utils.cc(uninitialized field, double I/O),src/utils/reader/reader.cc(double scan, incorrect error handling)Important Files Changed
row_num()accuracy. Implicitint64_t→size_taddition at line 280 can wrap toSIZE_MAXon a negative return, causing the growth loop (line 283) to spin forever.row_num()to all in-tree concrete suppliers.SupplierWrapperWithFirstBatch::row_num()silently omitsfirst_batch_rows.ArrowRecordBatchStreamSupplierconstructor is now a breaking change requiring a secondrow_numargument. ODPS subclasses (inodps_fragment_loader.h) are not updated and will fail to compile.nullptrwhere Arrow'sCountRowsAsyncexpects a value-typearrow::StopToken, risking a compile failure.row_num_is not initialized in the constructor initializer list, leaving it indeterminate if the constructor throws before the assignment.CountRows()beforeToRecordBatchReader(). OnCountRows()failure the function throws despite the log saying "Proceeding without row count", turning a hint into a hard requirement and breaking reads that could otherwise succeed.row_num()to the test helper, but iterates overbatches_which shrinks asGetNextBatch()pops elements — so the count decreases with each consumed batch rather than staying fixed at the total.ODPSStreamRecordBatchSupplierandODPSTableRecordBatchSupplierboth inherit fromIRecordBatchSupplierbut neither implements the newly-added pure virtualrow_num(), making both classes abstract and preventing compilation.Class Diagram
%%{init: {'theme': 'neutral'}}%% classDiagram class IRecordBatchSupplier { <<abstract>> +GetNextBatch() RecordBatch* +row_num() int64_t } class SupplierWrapperWithFirstBatch { -suppliers_ vector -first_batch_ RecordBatch -has_first_batch_ bool +GetNextBatch() RecordBatch* +row_num() int64_t } class CSVStreamRecordBatchSupplier { -row_num_ int64_t -file_path_ string -reader_ StreamingReader +GetNextBatch() RecordBatch* +row_num() int64_t } class CSVTableRecordBatchSupplier { -table_ Table +GetNextBatch() RecordBatch* +row_num() int64_t } class ArrowRecordBatchArraySupplier { -arrays_ vector~vector~ +GetNextBatch() RecordBatch* +row_num() int64_t } class ArrowRecordBatchStreamSupplier { -row_num_ int64_t -reader_ RecordBatchReader +GetNextBatch() RecordBatch* +row_num() int64_t } class ODPSStreamRecordBatchSupplier { +GetNextBatch() RecordBatch* ❌ row_num() MISSING } class ODPSTableRecordBatchSupplier { -table_ Table +GetNextBatch() RecordBatch* ❌ row_num() MISSING } class GeneratedRecordBatchSupplier { -batches_ vector +GetNextBatch() RecordBatch* +row_num() int64_t } IRecordBatchSupplier <|-- SupplierWrapperWithFirstBatch IRecordBatchSupplier <|-- CSVStreamRecordBatchSupplier IRecordBatchSupplier <|-- CSVTableRecordBatchSupplier IRecordBatchSupplier <|-- ArrowRecordBatchArraySupplier IRecordBatchSupplier <|-- ArrowRecordBatchStreamSupplier IRecordBatchSupplier <|-- ODPSStreamRecordBatchSupplier IRecordBatchSupplier <|-- ODPSTableRecordBatchSupplier IRecordBatchSupplier <|-- GeneratedRecordBatchSupplierComments Outside Diff (2)
src/storages/loader/loader_utils.cc, line 242-265 (link)row_num_left uninitialized on counting failureThe member
row_num_is not initialized in the constructor's initializer list, and there are two code paths where it is never assigned a value:count_file_result.ok()returnsfalse(the outerelsebranch).count_result.ok()returnsfalse(the innerelsebranch).In both cases,
row_num_retains an indeterminate value (undefined behavior in C++). TheLOG(WARNING)message even says "Proceeding with row_num_=0" — butrow_num_is never actually set to0.This means the subsequently computed
new_sizeinvertex_table.hcould be wildly large, causing either an out-of-memoryEnsureCapacitycall or overflowingnew_size.Fix: initialize
row_num_to0in the constructor initializer list:include/neug/storages/loader/odps_fragment_loader.h, line 109-151 (link)row_num()— compile breakrow_num()is now declaredpure virtualinIRecordBatchSupplier, but neitherODPSStreamRecordBatchSupplier(line 109) norODPSTableRecordBatchSupplier(line 133) implement it. Both classes are therefore abstract and cannot be instantiated — any attempt to do so will fail at compile time.ODPSTableRecordBatchSupplierowns atable_member (std::shared_ptr<arrow::Table>), so a trivial fix is:ODPSStreamRecordBatchSupplierdoes not load the table upfront, so a sensible fallback is:Both implementations must be added before this PR can compile.
Last reviewed commit: "fix"