GH-3011: Deny further writes after InternalParquetRecordWriter is aborted#3450
GH-3011: Deny further writes after InternalParquetRecordWriter is aborted#3450LuciferYang wants to merge 1 commit intoapache:masterfrom
Conversation
|
Hi @LuciferYang , since we check |
Hi @Jiayi-Wang-db, thanks for the question! You're right that the The concern here is more about silent data loss. Consider this scenario:
The user may not realize any data was lost until they read the file later and find records missing. By throwing an |
|
Hi @LuciferYang , thanks for the clarification. Yes, it seems like it could silently swallow the exception. |
Rationale for this change
After a write error (e.g. OOM during page flush),
InternalParquetRecordWritersets itsabortedflag to true and re-throws the exception. However, subsequent calls towrite()are silently accepted without checking this flag. Sinceclose()skips flushing whenabortedis true, all records written after the error are silently discarded — no exception, no warning. Users only discover the data loss when they attempt to read the file later and find records missing.To clarify: the
abortedcheck inclose()does prevent a malformed file from being written (the flush is correctly skipped). The issue is silent data loss — writes appear to succeed but the data is never persisted, which can be difficult to diagnose after the fact.What changes are included in this PR?
Added an
abortedstate check at the beginning ofwrite(). If the writer has been aborted due to a previous error, anIOExceptionis thrown immediately with a clear error message, preventing further writes to a writer in an undefined state.Are these changes tested?
Yes. Added
testWriteAfterAbortShouldThrowinTestParquetWriterErrorthat verifies:IOExceptionwith the expected messageclose()on an aborted writer completes without throwingAll existing tests in
parquet-hadooppass without modification.Are there any user-facing changes?
Yes. Users who previously caught write exceptions and continued writing to the same
ParquetWriterwill now receive anIOExceptionon subsequent write attempts. This is an intentional change to prevent silent data loss — the correct behavior after a write failure is to discard the writer and create a new one.Closes #3011