Skip to content

[arrow-avro] Add Avro BinaryFormat (Unframed) to reader and writer modules #8701

@jecsand838

Description

@jecsand838

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently arrow-avro can write OCF container files and SOE (Single‑Object Encoding) streams, and it can read OCF and framed streams (SOE / Confluent / Apicurio). It cannot write or read unframed Avro binary "datum" payloads (i.e., raw Avro record bodies without an SOE/registry prefix or OCF header). This makes it difficult to interoperate with systems that exchange naked Avro bodies while providing the schema out‑of‑band (configuration, RPC contract, topic metadata, etc.).

Concretely:

  • Writer: there is no Writer format that emits only the Avro body bytes per record. SOE always adds a 2‑byte magic + fingerprint (or ID) prefix, and OCF writes a file header/blocks.
  • Reader: ReaderBuilder::build_decoder requires a SchemaStore and expects a frame prefix; when the prefix is missing it errors with "Missing magic bytes and fingerprint." This prevents decoding raw Avro bodies when the schema is known upfront.

Describe the solution you'd like

Add first‑class Binary (unframed) format support to both the writer and the reader:

  1. Writer: new unframed stream format

    • In arrow-avro/src/writer/format.rs:
      • Implement a const‑generic AvroStreamFormat<const PREFIXED: bool> templated from the current AvroSoeFormat implementation
      • Alias type AvroSoeFormat = AvroStreamFormat<true> and type AvroBinaryFormat = AvroStreamFormat<false>. The second alias will implement the new AvroBinaryFormat format without code duplication.
    • In arrow-avro/src/writer/mod.rs, add a public alias called AvroRawStreamWriter as convenience mirroring AvroStreamWriter.

    Rationale: the existing AvroFormat abstraction already distinguishes framed vs unframed by NEEDS_PREFIX and sync_marker(); the new format simply sets NEEDS_PREFIX = false and writes nothing at stream start, yielding only Avro bodies from Writer::write_stream.

  2. Reader: opt‑in unframed decoding via ReaderBuilder::with_reader_schema

    • Enable ReaderBuilder::build_decoder to construct a Decoder for unframed raw binary when a reader schema is provided without a SchemaStore:
    • In arrow-avro/src/reader/mod.rs:
      • Builder rule: If writer_schema_store is None and reader_schema is Some, build_decoder() creates a decoder pre‑configured for unframed inputs. The reader_schema is assumed to be identical to the writer schema and no schema resolution is supported.
      • Decoder state: Add a small toggle (i.e., unframed: bool or enum PrefixMode { Framed, Unframed }). When unframed == true, decode() must skip handle_prefix and immediately try to decode exactly 1 row body via active_decoder.decode(&data[..], 1), respecting batch_size, and return consumed bytes accordingly. The current hard error path "Missing magic bytes and fingerprint" should not trigger in this mode.
    • Safety / behavior:
      • If the byte stream does start with a known framing prefix (SOE/Confluent/Apicurio), return a clear ArrowError::AvroError("Unexpected framed prefix in unframed (Binary) mode") to avoid ambiguous behavior.
      • If neither SchemaStore nor reader_schema is provided, keep returning InvalidArgumentError (existing documented behavior) to guide users.

Describe alternatives you've considered

  • Keep requiring a prefix and force users to add SOE/Confluent wrappers.
    This breaks compatibility with ecosystems that exchange only Avro bodies (no registry, no framing) and would force users to hand‑craft prefixes that the other side doesn’t expect. It also goes against the desire (tracked in recent issues) to reserve AvroBinaryFormat for exactly this unframed scenario.
  • Introduce a separate low‑level "datum decoder" type.
    Functionally similar, but adds a duplicate API surface and extra complexity. The existing Decoder already handles row‑by‑row streaming with a clear separation between "prefix handling" and "body decode;" a small mode toggle keeps the API cohesive.

Additional context

  • Spec references
    • SOE is defined by Avro as 2‑byte magic 0xC3 0x01 + fingerprint + Avro body; this is the framing Arrow supports today for streams. Binary in this issue refers to the Avro body alone (no prefix, no header). OCF remains unchanged.
    • arrow-avro docs list SOE and OCF as the two writer formats today and describe framed decoding (SOE/Confluent/Apicurio) for the streaming reader.
  • Why this matters in practice
    Popular systems (i.e., Databricks from_avro/to_avro) work with binary Avro columns and allow supplying schemas manually (no frame needed). Adding Binary mode in arrow-avro eliminates glue code and improves interop for stream processors and RPC frameworks that exchange frameless Avro datums with out‑of‑band schema agreements.
  • Backward compatibility
    • The change is additive. Existing OCF/SOE read/write codepaths are unaffected.
    • build_decoder() continues to error if neither a store nor a reader schema is provided, preserving the documented contract for framed decoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions