[arrow-avro] Add Avro BinaryFormat (Unframed) to reader and writer modules

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Currently `arrow-avro` can write **OCF** container files and **SOE** (Single‑Object Encoding) streams, and it can read **OCF** and framed streams (SOE / Confluent / Apicurio). It **cannot** write or read *unframed* Avro **binary "datum"** payloads (i.e., raw Avro record bodies without an SOE/registry prefix or OCF header). This makes it difficult to interoperate with systems that exchange naked Avro bodies while providing the schema out‑of‑band (configuration, RPC contract, topic metadata, etc.).

Concretely:
* **Writer**: there is no `Writer` format that emits *only* the Avro body bytes per record. SOE always adds a 2‑byte magic + fingerprint (or ID) prefix, and OCF writes a file header/blocks.
* **Reader**: `ReaderBuilder::build_decoder` **requires** a `SchemaStore` and expects a frame prefix; when the prefix is missing it errors with "Missing magic bytes and fingerprint." This prevents decoding raw Avro bodies when the schema is known upfront.

**Describe the solution you'd like**

Add first‑class **Binary (unframed) format** support to both the writer and the reader:

1. **Writer**: new unframed stream format
    * In `arrow-avro/src/writer/format.rs`:
        * Implement a const‑generic `AvroStreamFormat<const PREFIXED: bool>` templated from the current `AvroSoeFormat` implementation
        * Alias `type AvroSoeFormat = AvroStreamFormat<true>` and `type AvroBinaryFormat = AvroStreamFormat<false>`. The second alias will implement the new `AvroBinaryFormat` format without code duplication.
    * In `arrow-avro/src/writer/mod.rs`, add a public alias called `AvroRawStreamWriter` as convenience mirroring `AvroStreamWriter`.

    > Rationale: the existing `AvroFormat` abstraction already distinguishes framed vs unframed by `NEEDS_PREFIX` and `sync_marker()`; the new format simply sets `NEEDS_PREFIX = false` and writes nothing at stream start, yielding only Avro bodies from `Writer::write_stream`.

2. **Reader**: opt‑in unframed decoding via `ReaderBuilder::with_reader_schema`
    * Enable `ReaderBuilder::build_decoder` to construct a `Decoder` for **unframed raw binary** when a reader schema is provided **without** a `SchemaStore`:
    * In `arrow-avro/src/reader/mod.rs`:
      * **Builder rule**: If `writer_schema_store` is `None` **and** `reader_schema` is `Some`, `build_decoder()` creates a decoder pre‑configured for **unframed** inputs. The `reader_schema` is assumed to be **identical** to the writer schema and *no schema resolution* is supported.
      * **Decoder state**: Add a small toggle (i.e., `unframed: bool` or `enum PrefixMode { Framed, Unframed }`). When `unframed == true`, `decode()` must **skip** `handle_prefix` and immediately try to decode exactly 1 row body via `active_decoder.decode(&data[..], 1)`, respecting `batch_size`, and return consumed bytes accordingly. The current hard error path "Missing magic bytes and fingerprint" should not trigger in this mode.
    * **Safety / behavior**:
        - If the byte stream *does* start with a known framing prefix (SOE/Confluent/Apicurio), return a clear `ArrowError::AvroError("Unexpected framed prefix in unframed (Binary) mode")` to avoid ambiguous behavior.
        - If neither `SchemaStore` **nor** `reader_schema` is provided, keep returning `InvalidArgumentError` (existing documented behavior) to guide users.

**Describe alternatives you've considered**

* **Keep requiring a prefix and force users to add SOE/Confluent wrappers.**
  This breaks compatibility with ecosystems that exchange *only* Avro bodies (no registry, no framing) and would force users to hand‑craft prefixes that the other side doesn’t expect. It also goes against the desire (tracked in recent issues) to reserve `AvroBinaryFormat` for exactly this unframed scenario.
* **Introduce a separate low‑level "datum decoder" type.**
  Functionally similar, but adds a duplicate API surface and extra complexity. The existing `Decoder` already handles row‑by‑row streaming with a clear separation between "prefix handling" and "body decode;" a small mode toggle keeps the API cohesive.

**Additional context**

* **Spec references**
  * **SOE** is defined by [Avro](https://avro.apache.org/docs/1.11.1/specification/) as 2‑byte magic `0xC3 0x01` + fingerprint + Avro body; this is the framing Arrow supports today for streams. **Binary** in this issue refers to the Avro body alone (no prefix, no header). OCF remains unchanged.
  * `arrow-avro` docs list SOE and OCF as the two writer formats today and describe framed decoding (SOE/Confluent/Apicurio) for the streaming reader.
* **Why this matters in practice**
  Popular systems (i.e., Databricks `from_avro`/`to_avro`) work with *binary Avro* columns and [allow supplying schemas manually](https://docs.databricks.com/aws/en/structured-streaming/avro-dataframe#manually-specified-schema-example) (no frame needed). Adding Binary mode in `arrow-avro` eliminates glue code and improves interop for stream processors and RPC frameworks that exchange frameless Avro datums with out‑of‑band schema agreements.
* **Backward compatibility**
  * The change is additive. Existing OCF/SOE read/write codepaths are unaffected.
  * `build_decoder()` continues to error if neither a store nor a reader schema is provided, preserving the documented contract for framed decoding.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[arrow-avro] Add Avro BinaryFormat (Unframed) to reader and writer modules #8701

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[arrow-avro] Add Avro BinaryFormat (Unframed) to reader and writer modules #8701

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions