-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder in json-reader
#9497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7daefe6
bfb0ee5
c9ada27
30529df
749da82
8eea661
4016aa0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,10 +17,10 @@ | |
|
|
||
| use crate::reader::tape::{Tape, TapeElement}; | ||
| use crate::reader::{ArrayDecoder, DecoderContext}; | ||
| use arrow_array::OffsetSizeTrait; | ||
| use arrow_array::builder::{BooleanBufferBuilder, BufferBuilder}; | ||
| use arrow_buffer::buffer::NullBuffer; | ||
| use arrow_data::{ArrayData, ArrayDataBuilder}; | ||
| use arrow_array::{Array, GenericListArray, OffsetSizeTrait, make_array}; | ||
| use arrow_buffer::{OffsetBuffer, ScalarBuffer, buffer::NullBuffer}; | ||
| use arrow_data::ArrayData; | ||
| use arrow_schema::{ArrowError, DataType}; | ||
| use std::marker::PhantomData; | ||
|
|
||
|
|
@@ -91,17 +91,17 @@ impl<O: OffsetSizeTrait> ArrayDecoder for ListArrayDecoder<O> { | |
| offsets.append(offset) | ||
| } | ||
|
|
||
| let child_data = self.decoder.decode(tape, &child_pos)?; | ||
| let field = match &self.data_type { | ||
| DataType::List(f) | DataType::LargeList(f) => f.clone(), | ||
| _ => unreachable!(), | ||
| }; | ||
| // SAFETY: offsets are built monotonically starting from 0 | ||
| let offsets = | ||
| unsafe { OffsetBuffer::<O>::new_unchecked(ScalarBuffer::from(offsets.finish())) }; | ||
| let values = make_array(self.decoder.decode(tape, &child_pos)?); | ||
| let nulls = nulls.as_mut().map(|x| NullBuffer::new(x.finish())); | ||
|
|
||
| let data = ArrayDataBuilder::new(self.data_type.clone()) | ||
| .len(pos.len()) | ||
| .nulls(nulls) | ||
| .add_buffer(offsets.finish()) | ||
| .child_data(vec![child_data]); | ||
|
|
||
| // Safety | ||
| // Validated lengths above | ||
| Ok(unsafe { data.build_unchecked() }) | ||
| let array = GenericListArray::<O>::try_new(field, offsets, values, nulls)?; | ||
| Ok(array.into_data()) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this call simply creates an ArrayData (which is necessary given the API) so I am not sure it actually avoids any ArrayDatas In order to avoid ArrayData we would probably need to change the signature of decode to return an ArrayRef directly (rather than ArrayData)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. I can change the signature of all the decoders if you think it's worth a try
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes I do think it makes sense - - it would be nice to find some way to do that incrementally but if not then we may just have to do one big PR 🤔 |
||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does
try_newvalidate the offsets ? That could be a significant performance hitBasically as long as this doesn't do additional validation I think it looks good to me
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it will do validation, but I think it's cheap and the local benchmark does not cause a regression. BTW, we don't have an unchecked API for
ListArray, thenewAPI simply unwrap thetry_new.arrow-rs/arrow-array/src/array/list_array.rs
Lines 205 to 251 in e4b68e6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks on our runner seem to suggest that this PR is slower for some reason 🤔