Speed up JSON schema inference by ~2.8x by Rafferty97 · Pull Request #9494 · apache/arrow-rs

Rafferty97 · 2026-02-28T12:44:49Z

Which issue does this PR close?

This PR fixes #9484, and also sets the groundwork for implementing #9482. It also delivers an approximate 2.8x speed to JSON schema inference.

I have refactored the code that infers the schema of JSON sources, specifically:

Simplify the type inference logic, removing special cases
Schema inference now consumes TapeDecoder, eliminating the need to materialise rows into serde_json::Values first
Use arena allocation for efficiency
Remove scalar-to-array coersion, as the actual JSON reader doesn't support it
Move ValueIter into its own module

Rationale for this change

While working on #9482, I saw a need and opportunity to refactor the schema inference code for JSON schemas. I also discovered the bug detailed in #9484.

These changes not only make the code more readible and predictable by eliminating a lot of special case handling, but make it trivial to create a new inference function for "single field" JSON reading.

They have also provided a significant performance boost to the schema inference functions. I added a simple benchmark for infer_json_schema, which yielded the following results on my machine, reflecting an approx. 2.8x speed up:

Before changes:
infer_json_schema/1000 time: [1.4443 ms 1.4616 ms 1.4793 ms]
thrpt: [85.336 MiB/s 86.366 MiB/s 87.401 MiB/s]

After changes:
infer_json_schema/1000 time: [517.79 µs 519.10 µs 520.54 µs]
thrpt: [242.51 MiB/s 243.18 MiB/s 243.80 MiB/s]
change:
time: [−64.919% −64.485% −64.043%] (p = 0.00 < 0.05)
thrpt: [+178.11% +181.57% +185.06%]

What changes are included in this PR?

At a glance:

An overhaul of arrow-json/src/reader/schema.rs
Removed mixed_arrays.json as it's no longer valid, and replaced mixed_arrays.json.gz with arrays.json.gz
Added a dependency on Bumpalo for arena allocation

Because this is a somewhat sizeable PR, I've done my best to break into a logical sequence of commits to hopefully assist with the review.

Are these changes tested?

Yes, the changes pass all existing unit tests - except for one intentionally removed due to the change in behaviour related to #9484 (removing scalar-to-array promotion).

I have also added an additional benchmark for the schema inference performance.

Are there any user-facing changes?

There are no API changes, except for the addition of the record_count method on ValueIter.

However, the error messages returned by infer_json_schema and its cousins will significantly change, with most of them condensed to a single "Expected {expected}, found {got}" template.

Finally, some files that used to generate a valid schema will now return errors. However, this is desirable because those files would have failed to be read by the actual JSON reader anyway - due to the lack of support for scalar-to-array promotion in the JSON reader. (See #9484)

# Which issue does this PR close? Split out from #9494 to make review easier. It simply adds a benchmark for JSON schema inference. # Rationale for this change I have an open PR that significantly refactors the JSON schema inference code, so I want confidence that not only is the new code correct, but also has better performance than the existing code. # What changes are included in this PR? Adds a benchmark. # Are these changes tested? N/A # Are there any user-facing changes? No

arrow-json/src/reader/value_iter.rs

…ion (#9557) # Which issue does this PR close? Another smaller PR extracted from #9494. # Rationale for this change I've moved `ValueIter` into its own module because it's already self-contained, and because that will make it easier to review the changes I have made to `arrow-json/src/reader/schema.rs`. I've also added a public `record_count` function to `ValueIter` - which can be used to simplify consuming code in Datafusion which is currently tracking it separately. # What changes are included in this PR? * Moved `ValueIter` into own module * Added `record_count` method to `ValueIter` # Are these changes tested? Yes. # Are there any user-facing changes? Addition of one new public method, `ValueIter::record_count`.

alamb · 2026-03-20T14:38:15Z

@Rafferty97, can you please merge up this PR to resolve the conflicts and then we can run the benchmarks again to confirm the results

(`infer.rs`); remove obsolete tests

"scalar-to-array promotion", and adjust tests.

the need to parse rows into `serde_json::Value`s first.

Rafferty97 · 2026-03-20T14:56:00Z

@Rafferty97, can you please merge up this PR to resolve the conflicts and then we can run the benchmarks again to confirm the results

Done :)

alamb · 2026-03-20T15:01:38Z

run benchmark json_reader

adriangbot · 2026-03-20T15:04:29Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4098746251-480-66r56 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing json-schema (1b5d16b) to 322f9ce (merge-base) diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench json_reader
BENCH_FILTER=
Results will be posted here when complete

adriangbot · 2026-03-20T15:15:04Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                        json-schema                            main
-----                                        -----------                            ----
decode_binary_hex_json                       1.00     13.9±0.08ms        ? ?/sec    1.00     13.9±0.07ms        ? ?/sec
decode_binary_view_hex_json                  1.00     13.8±0.07ms        ? ?/sec    1.02     14.1±0.07ms        ? ?/sec
decode_fixed_binary_hex_json                 1.00     13.5±0.08ms        ? ?/sec    1.02     13.8±0.07ms        ? ?/sec
decode_list_long_i64_json/131072             1.01    309.1±0.53ms   253.3 MB/sec    1.00    306.9±1.07ms   255.1 MB/sec
decode_list_long_i64_serialize               1.00    187.4±5.41ms        ? ?/sec    1.02    190.3±4.78ms        ? ?/sec
decode_list_short_i64_json/131072            1.00     20.0±0.03ms   261.4 MB/sec    1.00     19.9±0.02ms   262.5 MB/sec
decode_list_short_i64_serialize              1.02     11.1±0.20ms        ? ?/sec    1.00     10.9±0.19ms        ? ?/sec
decode_wide_object_i64_json                  1.03   485.0±14.84ms        ? ?/sec    1.00    470.5±4.64ms        ? ?/sec
decode_wide_object_i64_serialize             1.00   429.9±11.51ms        ? ?/sec    1.00   431.4±13.72ms        ? ?/sec
decode_wide_projection_full_json/131072      1.01    794.5±5.45ms   219.0 MB/sec    1.00   787.8±24.54ms   220.9 MB/sec
decode_wide_projection_narrow_json/131072    1.02    452.1±0.91ms   384.9 MB/sec    1.00    443.2±2.55ms   392.6 MB/sec
infer_json_schema/1000                       1.00    733.2±1.71µs   172.2 MB/sec    2.11  1547.1±11.79µs    81.6 MB/sec
large_bench_primitive                        1.00   1535.1±2.36µs        ? ?/sec    1.00   1530.5±2.61µs        ? ?/sec
small_bench_list                             1.00      8.0±0.02µs        ? ?/sec    1.01      8.1±0.03µs        ? ?/sec
small_bench_primitive                        1.00      4.4±0.01µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
small_bench_primitive_with_utf8view          1.00      4.4±0.01µs        ? ?/sec    1.01      4.5±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	308.1s
Peak memory	3.5 GiB
Avg memory	2.9 GiB
CPU user	289.7s
CPU sys	18.1s
Disk read	0 B
Disk write	1.5 GiB

branch

Metric	Value
Wall time	311.7s
Peak memory	3.5 GiB
Avg memory	2.9 GiB
CPU user	294.3s
CPU sys	17.3s
Disk read	0 B
Disk write	1.0 MiB

alamb · 2026-03-20T15:18:49Z

infer_json_schema/1000 1.00 733.2±1.71µs 172.2 MB/sec 2.11 1547.1±11.79µs 81.6 MB/sec

That is certainly a nice result ❤️

alamb

Thanks again for this @Rafferty97 and for your patience

I took a look at the PR. my major comments are:

Can you please document the design / rationale (and why are there LazyLocks being used for what seem to be very small enums)
Can you ensure the behavior is the same as the existing code?

If we want to change the inference behavior I recommend proposing those changes in a separate PR so that we can evaluate the potential impact.

arrow-json/src/reader/schema/infer.rs

alamb · 2026-03-20T15:07:15Z

arrow-json/test/data/mixed_arrays.json

@@ -1,4 +0,0 @@
-{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}


What is the purpose of removing this file?

This file was used by tests related to an inference rule that coerces a mix of scalar and array values into an array type. I've removed this rule because the JSON reader can't actually do this coercion, so I figured it was better to error out instead.

I could reinstate these files and test that they cause schema inference to fail - but I'm unsure how useful that actually is?

alamb · 2026-03-20T15:07:47Z

arrow-json/test/data/arrays.json.gz

Given this file is so small (133 bytes), can you please unzip it to make the contents more explicit and easier to review and track changes

The contents are identical to arrays.json. I was following the pattern set by mixed_arrays.json(.gz). I needed to create this file for tests that previously used mixed_arrays.json(.gz) which were deleted. Those files were deleted because they aren't readable by the JSON reader - they rely on coercion semantics that no longer exist.

arrow-json/src/reader/schema/infer.rs

alamb · 2026-03-20T15:15:05Z

arrow-json/src/reader/schema.rs

        assert_eq!(small_field.data_type(), &DataType::Float64);
    }

-    #[test]


why is this test removed?

These tests pertain to coercion logic that I've removed, due to them being inconsistent with the JSON reader, which is incapable of doing these coercions.

arrow-json/src/reader/schema.rs

alamb · 2026-03-20T15:18:08Z

arrow-json/src/reader/schema/infer.rs

+}
+
+/// The type of a JSON value
+pub enum JsonType {


I found it strange that the Json type and tape value are now in the infer module -- they seem more widely applicable than just for schema inference

That's fair. I've moved them to a separate module within schema for better code organisation.

Rafferty97 · 2026-03-21T00:13:46Z

Thanks again for this @Rafferty97 and for your patience

I took a look at the PR. my major comments are:

Can you please document the design / rationale (and why are there LazyLocks being used for what seem to be very small enums)

Can you ensure the behavior is the same as the existing code?

If we want to change the inference behavior I recommend proposing those changes in a separate PR so that we can evaluate the potential impact.

Hi @alamb, thank you for taking a look over the PR and for the detailed feedback.

The LazyLocks are an optimisation to avoid allocating a bunch of identical Arcs for the primitive types. You're right that this warrants some explanatory comments.

The behaviour intentionally diverges from the existing code, because the existing code would perform coercions that the actual JSON reader itself doesn't do. So, when such a JSON file is encountered, the previous code would infer successfully but the actual reading into record batches would fail. This new code would return an error at inference time, which I think is more useful and less surprising to the end user.

refactor `infer_json_type` into helper functions.

github-actions bot added the arrow Changes to the arrow crate label Feb 28, 2026

Rafferty97 force-pushed the json-schema branch from 0dad3c7 to 0b881ac Compare February 28, 2026 13:11

Rafferty97 changed the title ~~Refactor and improve performance of JSON schema inference~~ Speed up JSON schema inference by ~2.8x Mar 2, 2026

Rafferty97 mentioned this pull request Mar 4, 2026

Support JSON arrays reader/parse for datafusion apache/datafusion#19924

Merged

Rafferty97 mentioned this pull request Mar 12, 2026

Add benchmark for infer_json_schema #9546

Merged

Rafferty97 force-pushed the json-schema branch from f398040 to 6d256ce Compare March 12, 2026 23:53

Rafferty97 force-pushed the json-schema branch from 6d256ce to b4f2488 Compare March 13, 2026 14:12

Rafferty97 mentioned this pull request Mar 14, 2026

Move ValueIter into own module, and add public record_count function #9557

Merged

Rafferty97 force-pushed the json-schema branch from b4f2488 to 49d5016 Compare March 14, 2026 00:24

Dandandan mentioned this pull request Mar 16, 2026

Parallelize infer_schema apache/datafusion#19970

Open

Dandandan reviewed Mar 18, 2026

View reviewed changes

arrow-json/src/reader/value_iter.rs Show resolved Hide resolved

Rafferty97 added 6 commits March 21, 2026 01:52

Rewrite JSON schema inference logic, and move it into its own module

88fc3f4

(`infer.rs`); remove obsolete tests

Remove "mixed_arrays.json" as the reader can't actually handle

27b07c3

"scalar-to-array promotion", and adjust tests.

Re-implement infer_json_schema to make use of TapeDecoder, removing

8d99f99

the need to parse rows into `serde_json::Value`s first.

Fix licenses, clippy fixes, remove unused indexmap crate

e82462a

Removed bumpalo, and use Arcs instead

52f2139

Fix clippy errors

1b5d16b

Rafferty97 force-pushed the json-schema branch from be0fdf5 to 1b5d16b Compare March 20, 2026 14:52

Rafferty97 mentioned this pull request Mar 20, 2026

Change occurances of truncate(0) to clear #9593

Open

alamb reviewed Mar 20, 2026

View reviewed changes

PR feedback and refactoring

6ee6985

Rafferty97 added 3 commits March 21, 2026 12:54

Make InferTy an enum directly rather than a newtyped Arc, and

cfba988

refactor `infer_json_type` into helper functions.

Memoize common types

b5df8ca

Fix accidental const instead of static

469b437

		@@ -1,4 +0,0 @@
		{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}

Conversation

Rafferty97 commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

alamb commented Mar 20, 2026

Uh oh!

Rafferty97 commented Mar 20, 2026

Uh oh!

alamb commented Mar 20, 2026

Uh oh!

adriangbot commented Mar 20, 2026

Uh oh!

adriangbot commented Mar 20, 2026

Uh oh!

alamb commented Mar 20, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rafferty97 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rafferty97 commented Feb 28, 2026 •

edited

Loading