Skip to content

Conversation

@Jiayi-Wang-db
Copy link

Rationale for this change

Inside the InternalParquetRecordWriter::Close finally block, we call close on parquetFileWriter, which may cause incomplete data to be flushed to the cloud if an exception is thrown during the close .

What changes are included in this PR?

Remove parquetFileWriter.close out of finally block and added a unit test.

Are these changes tested?

Yes.

Are there any user-facing changes?

Users wouldn't get incomplet parquet files because of torn writes.

Closes #3350

@wgtmac
Copy link
Member

wgtmac commented Oct 24, 2025

This looks reasonable to me. WDYT? @gszadovszky @Fokko

this.footer = new ParquetMetadata(new FileMetaData(schema, extraMetaData, Version.FULL_VERSION), blocks);
serializeFooter(footer, out, fileEncryptor, metadataConverter);
} catch (Exception e) {
aborted = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to swallow the exception, just set the flag and re-throw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do the same pattern for every public method that may throw an exception.

Copy link
Author

@Jiayi-Wang-db Jiayi-Wang-db Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not familiar with the direct buffer change, but in InternalParquetRecordWriter, there’s only one place where aborted is marked. Is that the only place that could cause an aborted write? If so, we don’t need to apply the same pattern to every public method in ParquetFileWriter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look that way. The write function in InternalParquetRecordWriter is the only public function that can throw an exception (except close). So after we mark it as aborted there and abort the file write in the close call, we should cover all cases.

Comment on lines 143 to 145
AutoCloseables.uncheckedClose(parquetFileWriter);
} finally {
AutoCloseables.uncheckedClose(columnStore, pageStore, bloomFilterWriteStore, parquetFileWriter);
AutoCloseables.uncheckedClose(columnStore, pageStore, bloomFilterWriteStore);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have ParquetFileWriter to handle the "aborted" state, this change can be reverted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I haven't finish my change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, then. I was too fast. 😄
Ping me when you're ready.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review!

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not familiar with the direct buffer change, but in InternalParquetRecordWriter, there’s only one place where aborted is marked. Is that the only place that could cause an aborted write? If so, we don’t need to apply the same pattern to every public method in ParquetFileWriter.

This is true for your workflow where ParquetFileWriter is only used via InternalParquetRecordWriter. But the latter one is a public class and used directly in other workflows. For a more complete fix it would be nicer to handle this case as well.

}
}

/* Mark the writer as aborted to avoid flushing incomplete data to the cloud. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "to the cloud" is not required. That is only one use-case.

Comment on lines +1839 to +1840
} catch (IOException e) {
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid flushing data to cloud when exception is thrown in parquet writer close

3 participants