Skip to content

Speed up JSON schema inference by ~2.8x#9494

Open
Rafferty97 wants to merge 10 commits intoapache:mainfrom
Rafferty97:json-schema
Open

Speed up JSON schema inference by ~2.8x#9494
Rafferty97 wants to merge 10 commits intoapache:mainfrom
Rafferty97:json-schema

Conversation

@Rafferty97
Copy link
Contributor

@Rafferty97 Rafferty97 commented Feb 28, 2026

Which issue does this PR close?

This PR fixes #9484, and also sets the groundwork for implementing #9482. It also delivers an approximate 2.8x speed to JSON schema inference.

I have refactored the code that infers the schema of JSON sources, specifically:

  • Simplify the type inference logic, removing special cases
  • Schema inference now consumes TapeDecoder, eliminating the need to materialise rows into serde_json::Values first
  • Use arena allocation for efficiency
  • Remove scalar-to-array coersion, as the actual JSON reader doesn't support it
  • Move ValueIter into its own module

Rationale for this change

While working on #9482, I saw a need and opportunity to refactor the schema inference code for JSON schemas. I also discovered the bug detailed in #9484.

These changes not only make the code more readible and predictable by eliminating a lot of special case handling, but make it trivial to create a new inference function for "single field" JSON reading.

They have also provided a significant performance boost to the schema inference functions. I added a simple benchmark for infer_json_schema, which yielded the following results on my machine, reflecting an approx. 2.8x speed up:

Before changes:
infer_json_schema/1000 time: [1.4443 ms 1.4616 ms 1.4793 ms]
thrpt: [85.336 MiB/s 86.366 MiB/s 87.401 MiB/s]

After changes:
infer_json_schema/1000 time: [517.79 µs 519.10 µs 520.54 µs]
thrpt: [242.51 MiB/s 243.18 MiB/s 243.80 MiB/s]
change:
time: [−64.919% −64.485% −64.043%] (p = 0.00 < 0.05)
thrpt: [+178.11% +181.57% +185.06%]

What changes are included in this PR?

At a glance:

  • An overhaul of arrow-json/src/reader/schema.rs
  • Removed mixed_arrays.json as it's no longer valid, and replaced mixed_arrays.json.gz with arrays.json.gz
  • Added a dependency on Bumpalo for arena allocation

Because this is a somewhat sizeable PR, I've done my best to break into a logical sequence of commits to hopefully assist with the review.

Are these changes tested?

Yes, the changes pass all existing unit tests - except for one intentionally removed due to the change in behaviour related to #9484 (removing scalar-to-array promotion).

I have also added an additional benchmark for the schema inference performance.

Are there any user-facing changes?

There are no API changes, except for the addition of the record_count method on ValueIter.

However, the error messages returned by infer_json_schema and its cousins will significantly change, with most of them condensed to a single "Expected {expected}, found {got}" template.

Finally, some files that used to generate a valid schema will now return errors. However, this is desirable because those files would have failed to be read by the actual JSON reader anyway - due to the lack of support for scalar-to-array promotion in the JSON reader. (See #9484)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 28, 2026
@Rafferty97 Rafferty97 changed the title Refactor and improve performance of JSON schema inference Speed up JSON schema inference by ~2.8x Mar 2, 2026
Dandandan pushed a commit that referenced this pull request Mar 13, 2026
# Which issue does this PR close?

Split out from #9494 to make review easier. It simply adds a benchmark
for JSON schema inference.

# Rationale for this change

I have an open PR that significantly refactors the JSON schema inference
code, so I want confidence that not only is the new code correct, but
also has better performance than the existing code.

# What changes are included in this PR?

Adds a benchmark.

# Are these changes tested?

N/A

# Are there any user-facing changes?

No
alamb pushed a commit that referenced this pull request Mar 18, 2026
…ion (#9557)

# Which issue does this PR close?

Another smaller PR extracted from #9494.

# Rationale for this change

I've moved `ValueIter` into its own module because it's already
self-contained, and because that will make it easier to review the
changes I have made to `arrow-json/src/reader/schema.rs`.

I've also added a public `record_count` function to `ValueIter` - which
can be used to simplify consuming code in Datafusion which is currently
tracking it separately.

# What changes are included in this PR?

* Moved `ValueIter` into own module
* Added `record_count` method to `ValueIter`

# Are these changes tested?

Yes.

# Are there any user-facing changes?

Addition of one new public method, `ValueIter::record_count`.
@alamb
Copy link
Contributor

alamb commented Mar 20, 2026

@Rafferty97, can you please merge up this PR to resolve the conflicts and then we can run the benchmarks again to confirm the results

@Rafferty97
Copy link
Contributor Author

@Rafferty97, can you please merge up this PR to resolve the conflicts and then we can run the benchmarks again to confirm the results

Done :)

@alamb
Copy link
Contributor

alamb commented Mar 20, 2026

run benchmark json_reader

@adriangbot
Copy link

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4098746251-480-66r56 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing json-schema (1b5d16b) to 322f9ce (merge-base) diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench json_reader
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                        json-schema                            main
-----                                        -----------                            ----
decode_binary_hex_json                       1.00     13.9±0.08ms        ? ?/sec    1.00     13.9±0.07ms        ? ?/sec
decode_binary_view_hex_json                  1.00     13.8±0.07ms        ? ?/sec    1.02     14.1±0.07ms        ? ?/sec
decode_fixed_binary_hex_json                 1.00     13.5±0.08ms        ? ?/sec    1.02     13.8±0.07ms        ? ?/sec
decode_list_long_i64_json/131072             1.01    309.1±0.53ms   253.3 MB/sec    1.00    306.9±1.07ms   255.1 MB/sec
decode_list_long_i64_serialize               1.00    187.4±5.41ms        ? ?/sec    1.02    190.3±4.78ms        ? ?/sec
decode_list_short_i64_json/131072            1.00     20.0±0.03ms   261.4 MB/sec    1.00     19.9±0.02ms   262.5 MB/sec
decode_list_short_i64_serialize              1.02     11.1±0.20ms        ? ?/sec    1.00     10.9±0.19ms        ? ?/sec
decode_wide_object_i64_json                  1.03   485.0±14.84ms        ? ?/sec    1.00    470.5±4.64ms        ? ?/sec
decode_wide_object_i64_serialize             1.00   429.9±11.51ms        ? ?/sec    1.00   431.4±13.72ms        ? ?/sec
decode_wide_projection_full_json/131072      1.01    794.5±5.45ms   219.0 MB/sec    1.00   787.8±24.54ms   220.9 MB/sec
decode_wide_projection_narrow_json/131072    1.02    452.1±0.91ms   384.9 MB/sec    1.00    443.2±2.55ms   392.6 MB/sec
infer_json_schema/1000                       1.00    733.2±1.71µs   172.2 MB/sec    2.11  1547.1±11.79µs    81.6 MB/sec
large_bench_primitive                        1.00   1535.1±2.36µs        ? ?/sec    1.00   1530.5±2.61µs        ? ?/sec
small_bench_list                             1.00      8.0±0.02µs        ? ?/sec    1.01      8.1±0.03µs        ? ?/sec
small_bench_primitive                        1.00      4.4±0.01µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
small_bench_primitive_with_utf8view          1.00      4.4±0.01µs        ? ?/sec    1.01      4.5±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 308.1s
Peak memory 3.5 GiB
Avg memory 2.9 GiB
CPU user 289.7s
CPU sys 18.1s
Disk read 0 B
Disk write 1.5 GiB

branch

Metric Value
Wall time 311.7s
Peak memory 3.5 GiB
Avg memory 2.9 GiB
CPU user 294.3s
CPU sys 17.3s
Disk read 0 B
Disk write 1.0 MiB

@alamb
Copy link
Contributor

alamb commented Mar 20, 2026

infer_json_schema/1000 1.00 733.2±1.71µs 172.2 MB/sec 2.11 1547.1±11.79µs 81.6 MB/sec

That is certainly a nice result ❤️

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this @Rafferty97 and for your patience

I took a look at the PR. my major comments are:

  1. Can you please document the design / rationale (and why are there LazyLocks being used for what seem to be very small enums)
  2. Can you ensure the behavior is the same as the existing code?

If we want to change the inference behavior I recommend proposing those changes in a separate PR so that we can evaluate the potential impact.

@@ -1,4 +0,0 @@
{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of removing this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was used by tests related to an inference rule that coerces a mix of scalar and array values into an array type. I've removed this rule because the JSON reader can't actually do this coercion, so I figured it was better to error out instead.

I could reinstate these files and test that they cause schema inference to fail - but I'm unsure how useful that actually is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this file is so small (133 bytes), can you please unzip it to make the contents more explicit and easier to review and track changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents are identical to arrays.json. I was following the pattern set by mixed_arrays.json(.gz). I needed to create this file for tests that previously used mixed_arrays.json(.gz) which were deleted. Those files were deleted because they aren't readable by the JSON reader - they rely on coercion semantics that no longer exist.

assert_eq!(small_field.data_type(), &DataType::Float64);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this test removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests pertain to coercion logic that I've removed, due to them being inconsistent with the JSON reader, which is incapable of doing these coercions.

}

/// The type of a JSON value
pub enum JsonType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it strange that the Json type and tape value are now in the infer module -- they seem more widely applicable than just for schema inference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I've moved them to a separate module within schema for better code organisation.

@Rafferty97
Copy link
Contributor Author

Thanks again for this @Rafferty97 and for your patience

I took a look at the PR. my major comments are:

  1. Can you please document the design / rationale (and why are there LazyLocks being used for what seem to be very small enums)
  2. Can you ensure the behavior is the same as the existing code?

If we want to change the inference behavior I recommend proposing those changes in a separate PR so that we can evaluate the potential impact.

Hi @alamb, thank you for taking a look over the PR and for the detailed feedback.

The LazyLocks are an optimisation to avoid allocating a bunch of identical Arcs for the primitive types. You're right that this warrants some explanatory comments.

The behaviour intentionally diverges from the existing code, because the existing code would perform coercions that the actual JSON reader itself doesn't do. So, when such a JSON file is encountered, the previous code would infer successfully but the actual reading into record batches would fail. This new code would return an error at inference time, which I think is more useful and less surprising to the end user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON reader doesn't support scalar-to-list promotion, even though schema inference does

4 participants