Speed up JSON schema inference by ~2.8x#9494
Speed up JSON schema inference by ~2.8x#9494Rafferty97 wants to merge 10 commits intoapache:mainfrom
Conversation
0dad3c7 to
0b881ac
Compare
# Which issue does this PR close? Split out from #9494 to make review easier. It simply adds a benchmark for JSON schema inference. # Rationale for this change I have an open PR that significantly refactors the JSON schema inference code, so I want confidence that not only is the new code correct, but also has better performance than the existing code. # What changes are included in this PR? Adds a benchmark. # Are these changes tested? N/A # Are there any user-facing changes? No
…ion (#9557) # Which issue does this PR close? Another smaller PR extracted from #9494. # Rationale for this change I've moved `ValueIter` into its own module because it's already self-contained, and because that will make it easier to review the changes I have made to `arrow-json/src/reader/schema.rs`. I've also added a public `record_count` function to `ValueIter` - which can be used to simplify consuming code in Datafusion which is currently tracking it separately. # What changes are included in this PR? * Moved `ValueIter` into own module * Added `record_count` method to `ValueIter` # Are these changes tested? Yes. # Are there any user-facing changes? Addition of one new public method, `ValueIter::record_count`.
|
@Rafferty97, can you please merge up this PR to resolve the conflicts and then we can run the benchmarks again to confirm the results |
(`infer.rs`); remove obsolete tests
"scalar-to-array promotion", and adjust tests.
the need to parse rows into `serde_json::Value`s first.
Done :) |
|
run benchmark json_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
That is certainly a nice result ❤️ |
alamb
left a comment
There was a problem hiding this comment.
Thanks again for this @Rafferty97 and for your patience
I took a look at the PR. my major comments are:
- Can you please document the design / rationale (and why are there LazyLocks being used for what seem to be very small enums)
- Can you ensure the behavior is the same as the existing code?
If we want to change the inference behavior I recommend proposing those changes in a separate PR so that we can evaluate the potential impact.
| @@ -1,4 +0,0 @@ | |||
| {"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1} | |||
There was a problem hiding this comment.
What is the purpose of removing this file?
There was a problem hiding this comment.
This file was used by tests related to an inference rule that coerces a mix of scalar and array values into an array type. I've removed this rule because the JSON reader can't actually do this coercion, so I figured it was better to error out instead.
I could reinstate these files and test that they cause schema inference to fail - but I'm unsure how useful that actually is?
There was a problem hiding this comment.
Given this file is so small (133 bytes), can you please unzip it to make the contents more explicit and easier to review and track changes
There was a problem hiding this comment.
The contents are identical to arrays.json. I was following the pattern set by mixed_arrays.json(.gz). I needed to create this file for tests that previously used mixed_arrays.json(.gz) which were deleted. Those files were deleted because they aren't readable by the JSON reader - they rely on coercion semantics that no longer exist.
| assert_eq!(small_field.data_type(), &DataType::Float64); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
These tests pertain to coercion logic that I've removed, due to them being inconsistent with the JSON reader, which is incapable of doing these coercions.
| } | ||
|
|
||
| /// The type of a JSON value | ||
| pub enum JsonType { |
There was a problem hiding this comment.
I found it strange that the Json type and tape value are now in the infer module -- they seem more widely applicable than just for schema inference
There was a problem hiding this comment.
That's fair. I've moved them to a separate module within schema for better code organisation.
Hi @alamb, thank you for taking a look over the PR and for the detailed feedback. The The behaviour intentionally diverges from the existing code, because the existing code would perform coercions that the actual JSON reader itself doesn't do. So, when such a JSON file is encountered, the previous code would infer successfully but the actual reading into record batches would fail. This new code would return an error at inference time, which I think is more useful and less surprising to the end user. |
refactor `infer_json_type` into helper functions.
Which issue does this PR close?
This PR fixes #9484, and also sets the groundwork for implementing #9482. It also delivers an approximate 2.8x speed to JSON schema inference.
I have refactored the code that infers the schema of JSON sources, specifically:
TapeDecoder, eliminating the need to materialise rows intoserde_json::Values firstValueIterinto its own moduleRationale for this change
While working on #9482, I saw a need and opportunity to refactor the schema inference code for JSON schemas. I also discovered the bug detailed in #9484.
These changes not only make the code more readible and predictable by eliminating a lot of special case handling, but make it trivial to create a new inference function for "single field" JSON reading.
They have also provided a significant performance boost to the schema inference functions. I added a simple benchmark for
infer_json_schema, which yielded the following results on my machine, reflecting an approx. 2.8x speed up:Before changes:
infer_json_schema/1000 time: [1.4443 ms 1.4616 ms 1.4793 ms]
thrpt: [85.336 MiB/s 86.366 MiB/s 87.401 MiB/s]
After changes:
infer_json_schema/1000 time: [517.79 µs 519.10 µs 520.54 µs]
thrpt: [242.51 MiB/s 243.18 MiB/s 243.80 MiB/s]
change:
time: [−64.919% −64.485% −64.043%] (p = 0.00 < 0.05)
thrpt: [+178.11% +181.57% +185.06%]
What changes are included in this PR?
At a glance:
Because this is a somewhat sizeable PR, I've done my best to break into a logical sequence of commits to hopefully assist with the review.
Are these changes tested?
Yes, the changes pass all existing unit tests - except for one intentionally removed due to the change in behaviour related to #9484 (removing scalar-to-array promotion).
I have also added an additional benchmark for the schema inference performance.
Are there any user-facing changes?
There are no API changes, except for the addition of the
record_countmethod onValueIter.However, the error messages returned by infer_json_schema and its cousins will significantly change, with most of them condensed to a single "Expected {expected}, found {got}" template.
Finally, some files that used to generate a valid schema will now return errors. However, this is desirable because those files would have failed to be read by the actual JSON reader anyway - due to the lack of support for scalar-to-array promotion in the JSON reader. (See #9484)