GH-3414: Add parseJson to VariantBuilder for JSON-to-Variant conversion#3415
GH-3414: Add parseJson to VariantBuilder for JSON-to-Variant conversion#3415gaurav7261 wants to merge 1 commit intoapache:masterfrom
Conversation
|
@alamb @aihuaxu read https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/ and check the feasibility of having our S3 Sink connector write variant, found out that parseJson can be a better fit here, wdyt? is it making sense |
…uctured fields Write Kafka Connect JSON fields (Debezium CDC, Confluent Protobuf Struct, custom messages, maps, arrays) as native Parquet VARIANT columns instead of plain STRING, improving storage efficiency and query performance. - Upgrade parquet-java to 1.17.0 for VARIANT logical type support - https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/ - Add config: parquet.variant.enabled, parquet.variant.connect.names, parquet.variant.field.names - Auto-detect recursive schemas (google.protobuf.Struct) as VARIANT - Stream JSON-to-Variant conversion ported from Apache Spark (PR also raised in parquet-java, once approved, code will be neat here in this repo: apache/parquet-java#3415) - Unwrap Protobuf Struct/Value/ListValue/map-as-array to clean JSON - Graceful fallback for non-JSON values (e.g. __debezium_unavailable_value) - Feature is fully opt-in (disabled by default), zero impact on existing connectors
|
@Fokko can you please review, is it looking good to you? |
|
FWIW we have a similar method in Rust in case that is interesting |
1274733 to
643ac64
Compare
|
@julienledem thanks for the call, I have added notice, please review |
|
@gszadovszky can you please review as well |
|
I'm not familiar with this code yet but I think it is worth adding. @emkornfield @gene-db @rdblue WDYT? |
steveloughran
left a comment
There was a problem hiding this comment.
I think the json parsing should be in a class alongside VariantBuilder, rather than inside it, some VariantJsonParser class.
- ensures that use of jackson classes are isolated; the core variant does not need jackson on the CP
- if there is anything which can be done to improve single line JSON content performance, then it can be done there as it's life will span the whole file, rather than a row
Once it's split out, a new question surfaces: should it belong in parquet-jackson? Personally, I think it should
| This project includes code from Apache Spark with the following copyright | ||
| notice: | ||
|
|
||
| Apache Spark |
There was a problem hiding this comment.
do cross-ASF projects need this credit? What I do think is good is ensure the original authors get credit in the final commit message
| import org.junit.Assert; | ||
| import org.junit.Test; | ||
|
|
||
| public class TestVariantParseJson { |
There was a problem hiding this comment.
- nested object parsing?
- what about invalid json?
- empty file
- not a json file
- incomplete
- large json with many values
| * @throws IOException if the JSON is malformed or an I/O error occurs | ||
| */ | ||
| public static Variant parseJson(String json) throws IOException { | ||
| try (JsonParser parser = JSON_FACTORY.createParser(json)) { |
| * @return the parsed Variant | ||
| * @throws IOException if the JSON is malformed or an I/O error occurs | ||
| */ | ||
| public static Variant parseJson(String json) throws IOException { |
There was a problem hiding this comment.
I was to suggest making CharSequence for feeding in from other places (string fields within avro, ...) but it looks like jackson 2 doesn't support that itself.
| */ | ||
| public class VariantBuilder { | ||
|
|
||
| private static final JsonFactory JSON_FACTORY = new JsonFactory(); |
There was a problem hiding this comment.
factory should be built with some constraints so that giving it a malicious json file should be rejected rather than trigger OOM problems, etc. From the javadocs.
JsonFactory f = JsonFactory.builder()
.streamReadConstraints(
StreamReadConstraints.builder()
.maxNestingDepth(500)
.maxStringLength(10_000_000)
.maxDocumentLength(5_000_000)
.build()
)
.build();
Rationale for this change
Every consumer of
parquet-variantcurrently has to independently implement JSON-to-Variant parsing. Apache Spark has one in itscommon/variantmodule(source), our Kafka Connect S3 sink connector had to write one, and any other project (Flink, Trino, DuckDB-Java, etc.) would need to do the
same. Since
VariantBuilderalready provides all the low-levelappend*()primitives,parseJson()is the natural completion of that API — a canonical, reusable entry point for the most common usecase: converting a JSON string into a Variant.
What changes are included in this PR?
parquet-variant/pom.xml: Addedjackson-core(compile) andparquet-jackson(runtime) dependencies, following the same pattern asparquet-hadoop.VariantBuilder.java: Added two public static methods:parseJson(String json)— convenience method that creates a Jackson streaming parser internally.parseJson(JsonParser parser)— for callers who already have a positioned parser (e.g., reading from a stream).VariantBuilder.buildJson:buildJson()— recursive single-pass streaming parser handling OBJECT, ARRAY, STRING, NUMBER_INT, NUMBER_FLOAT, TRUE, FALSE, NULL.appendSmallestLong()— selects the smallest integer type (BYTE/SHORT/INT/LONG) based on value range.tryAppendDecimal()— decimal-first encoding for floating-point numbers; falls back to double only for scientific notation or values exceeding DECIMAL16 precision (38 digits).TestVariantParseJson.java: 32 new tests covering all primitive types, objects (empty, simple, nested, null values, sorted keys, duplicate keys), arrays (empty, simple, nested, mixed types), andedge cases (unicode, escaped strings, deeply nested documents, scientific notation, integer overflow to decimal, malformed JSON).
Are these changes tested?
Yes. 32 new tests in
TestVariantParseJson.Are there any user-facing changes?
Yes. Two new public static methods on
VariantBuilder:VariantBuilder.parseJson(String json)— returns aVariantVariantBuilder.parseJson(JsonParser parser)— returns aVariantThese are additive API additions with no breaking changes to existing APIs.
Closes parseJson implementation in VariantBuilder #3414