-
Couldn't load subscription status.
- Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently arrow-avro can write OCF container files and SOE (Single‑Object Encoding) streams, and it can read OCF and framed streams (SOE / Confluent / Apicurio). It cannot write or read unframed Avro binary "datum" payloads (i.e., raw Avro record bodies without an SOE/registry prefix or OCF header). This makes it difficult to interoperate with systems that exchange naked Avro bodies while providing the schema out‑of‑band (configuration, RPC contract, topic metadata, etc.).
Concretely:
- Writer: there is no
Writerformat that emits only the Avro body bytes per record. SOE always adds a 2‑byte magic + fingerprint (or ID) prefix, and OCF writes a file header/blocks. - Reader:
ReaderBuilder::build_decoderrequires aSchemaStoreand expects a frame prefix; when the prefix is missing it errors with "Missing magic bytes and fingerprint." This prevents decoding raw Avro bodies when the schema is known upfront.
Describe the solution you'd like
Add first‑class Binary (unframed) format support to both the writer and the reader:
-
Writer: new unframed stream format
- In
arrow-avro/src/writer/format.rs:- Implement a const‑generic
AvroStreamFormat<const PREFIXED: bool>templated from the currentAvroSoeFormatimplementation - Alias
type AvroSoeFormat = AvroStreamFormat<true>andtype AvroBinaryFormat = AvroStreamFormat<false>. The second alias will implement the newAvroBinaryFormatformat without code duplication.
- Implement a const‑generic
- In
arrow-avro/src/writer/mod.rs, add a public alias calledAvroRawStreamWriteras convenience mirroringAvroStreamWriter.
Rationale: the existing
AvroFormatabstraction already distinguishes framed vs unframed byNEEDS_PREFIXandsync_marker(); the new format simply setsNEEDS_PREFIX = falseand writes nothing at stream start, yielding only Avro bodies fromWriter::write_stream. - In
-
Reader: opt‑in unframed decoding via
ReaderBuilder::with_reader_schema- Enable
ReaderBuilder::build_decoderto construct aDecoderfor unframed raw binary when a reader schema is provided without aSchemaStore: - In
arrow-avro/src/reader/mod.rs:- Builder rule: If
writer_schema_storeisNoneandreader_schemaisSome,build_decoder()creates a decoder pre‑configured for unframed inputs. Thereader_schemais assumed to be identical to the writer schema and no schema resolution is supported. - Decoder state: Add a small toggle (i.e.,
unframed: boolorenum PrefixMode { Framed, Unframed }). Whenunframed == true,decode()must skiphandle_prefixand immediately try to decode exactly 1 row body viaactive_decoder.decode(&data[..], 1), respectingbatch_size, and return consumed bytes accordingly. The current hard error path "Missing magic bytes and fingerprint" should not trigger in this mode.
- Builder rule: If
- Safety / behavior:
- If the byte stream does start with a known framing prefix (SOE/Confluent/Apicurio), return a clear
ArrowError::AvroError("Unexpected framed prefix in unframed (Binary) mode")to avoid ambiguous behavior. - If neither
SchemaStorenorreader_schemais provided, keep returningInvalidArgumentError(existing documented behavior) to guide users.
- If the byte stream does start with a known framing prefix (SOE/Confluent/Apicurio), return a clear
- Enable
Describe alternatives you've considered
- Keep requiring a prefix and force users to add SOE/Confluent wrappers.
This breaks compatibility with ecosystems that exchange only Avro bodies (no registry, no framing) and would force users to hand‑craft prefixes that the other side doesn’t expect. It also goes against the desire (tracked in recent issues) to reserveAvroBinaryFormatfor exactly this unframed scenario. - Introduce a separate low‑level "datum decoder" type.
Functionally similar, but adds a duplicate API surface and extra complexity. The existingDecoderalready handles row‑by‑row streaming with a clear separation between "prefix handling" and "body decode;" a small mode toggle keeps the API cohesive.
Additional context
- Spec references
- SOE is defined by Avro as 2‑byte magic
0xC3 0x01+ fingerprint + Avro body; this is the framing Arrow supports today for streams. Binary in this issue refers to the Avro body alone (no prefix, no header). OCF remains unchanged. arrow-avrodocs list SOE and OCF as the two writer formats today and describe framed decoding (SOE/Confluent/Apicurio) for the streaming reader.
- SOE is defined by Avro as 2‑byte magic
- Why this matters in practice
Popular systems (i.e., Databricksfrom_avro/to_avro) work with binary Avro columns and allow supplying schemas manually (no frame needed). Adding Binary mode inarrow-avroeliminates glue code and improves interop for stream processors and RPC frameworks that exchange frameless Avro datums with out‑of‑band schema agreements. - Backward compatibility
- The change is additive. Existing OCF/SOE read/write codepaths are unaffected.
build_decoder()continues to error if neither a store nor a reader schema is provided, preserving the documented contract for framed decoding.