[PECOBLR-1121] Arrow patch to circumvent Arrow issues with JDk 16+#1243
Merged
tejassp-db merged 161 commits intomainfrom Mar 2, 2026
Merged
[PECOBLR-1121] Arrow patch to circumvent Arrow issues with JDk 16+#1243tejassp-db merged 161 commits intomainfrom
tejassp-db merged 161 commits intomainfrom
Conversation
Patch Arrow to create a Databricks ArrowBuf which allocates memory on the heap and provides access to it through Java methods. This removes the need to specify "--add-opens=java.base/java.nio=ALL-UNNAMED" as JVM args for JDK 16+.
Added tests to validate Arrow patch code paths. Added Maven profiles to validate the behaviour across JVM versions and with/without "--add-opens=java.base/java.nio=ALL-UNNAMED" JVM arguments. By default, JVM version 11 is assumed. To use other JVM versions, the toolchain needs to be setup to point to the correct Java versions on the local machine in .m2/toolchains.xml.
Use native Arrow if available. Otherwise fallback to the patch version.
Remove irrelevant reference counting in patch code. Patch code uses heap memory for arrow operations and reference counting is not required.
Add unit tests for all public API.
Remove redundant todos for accounting.
A JMH benchmark for Arrow parsing of patched and unpatched Arrow Buffers and Buffer allocators.
Convert the code to muli module project. - Cleaner separation of JAR generation for Uber jar and normal/thin JAR with some patched Arrow changes. - Test modules with tests for shaded jars.
Tests to verify that all dependencies are shaded as expected.
Add tests to handle all data types supported by Arrow.
Patch DecimalUtility to not use unsafe methods to set decimal values on DatabricksArrowBuf.
Add tests for Boolean, Null, Fixed size list, UTF-8 view, Binary view, list view, large list view types.
Remove default profile of JDK 11. Do not fail on Github actions.
Add a boolean field to specify whether the patched Arrow code is being used in the JVM to parse Arrow responses.
Collaborator
Author
|
Current github actions wont pass, because the current github workflows are setup for a single module maven project. I have a separate branch to enable these test runs and I have run it from there. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Databricks server shares query results in Arrow format for easy cross language functionality. The JDBC driver experiences compatibility issues with JDK 16 and later versions when processing Arrow results.
This problem arises from stricter encapsulation of internal APIs in newer Java versions, which affects the driver's use of the Apache Arrow result format consumption with the Apache Arrow library. The JDBC driver is used in partner solutions, where they do not have control of the runtime environment, and the workaround of setting JVM arguments is not feasible.
This PR patches some of the Arrow code to provide alternative JVM Heap based byte allocators that do not use native
MemoryUtilbased direct reads from off-heap memory. This implementation uses the native Arrow code path if feasible, else falls back to the patched code.All the code has been tested for read compatibility with all Arrow types, latency benchmarks have been tested, and automated tests have been added as well.
During the course of this change it became necessary to also convert the project into a multi-module maven project