Skip to content

Server-side batching (SSB)#537

Open
bevzzz wants to merge 100 commits intomainfrom
feat/ssb
Open

Server-side batching (SSB)#537
bevzzz wants to merge 100 commits intomainfrom
feat/ssb

Conversation

@bevzzz
Copy link
Collaborator

@bevzzz bevzzz commented Feb 17, 2026

This PR adds BatchContext which implements Server-Side Batching protocol.

List<WeaviateObject<Map<String, Object>>> objects = List.of(...);
List<ObjectReference> references = List.of();
List<CompletableFuture<TaskHandle>> submitted = new ArrayList<>();

try (BatchContext<?> batch = collection.batch.start()) {
  
  // Optional: Define a retry policy that retries each task up to three times.
  Function<TaskHandle, CompletableFuture<TaskHandle>> retryMax3Times = task -> task
    .result().thenCompose(result -> {
      if (result.error().isEmpty() {
        return result;
      }
      if (task.timesRetried == 3) {
        throw new RuntimeException(result.error().get());
      }
      return retryMax3Times.apply(batch.retry(task));
    });

  // Submit all objects to the batch
  objects.map(batch::add).map(retryMax3Times).forEach(submitted::add);
  
  // Submit all references to the batch
  references.map(batch::add).map(retryMax3Times).forEach(submitted::add;
}

// Await until all tasks complete
CompletableFuture.allOf(submitted).get();

Description

See BatchContext Javadoc for a detailed architecture discussion.

Testing

BatchContext does all the heavy lifting: state management, thread coordination, etc.
To validate different scenario combinations (server shutdown, backoff, batch overflow, stream hangup, OOM) BatchContextTest uses a simple harness that lets it emit server-side events and await client-side messages.

BatchITest integration test helps verify that the client produces valid messages and adheres to to SSB protocol.

Miscellaneous

  • ConsistencyLevel and Tenant are not Optional type in collection handle defaults to make their "optionality" more explicit.
  • Merged VectorizersITest with SearchITest as both share the same setup, which takes CI time to spin up twice
  • Deleted the now-redundant file_replication.proto and the associated generated protobuf stubs.
  • Separated unit tests into their own CI job. There's no reason to run them for each Weaviate version, as they never hit the server. Plus it lets us catch smaller errors before starting the heavy work of spinning up containers.

Extended GrpcTransport interface to implements StreamFactory.
Added documentation to batch's primitives.
Renamed Message -> Batch, StreamMessage -> Message, MessageProducer -> Messeger, and EventHandler -> Eventer.
Extracted message size calculation into MessageSizeUtil.
Added sketch implementation for the Send routine.
Still raw and riddled with comments, but it's a good start.
Copy link

@orca-security-eu orca-security-eu bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed SAST high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca

LinkedHashMap::reversed was first introduced in JDK 21.
The ListIterator approach requires allocating a list,
but yields a much simpler code in return.
BatchContext can deal with happy path, i.e. no oom, no shutdowns, etc.
Calculating the size precisely is possible, but requires
writing code that understands the internal structure of
the gRPC messages. A much simpler approach is to estimate
the size approximately and keep a tiny (compared to the
total message size) margin for error.
bevzzz added 9 commits March 2, 2026 12:40
- change condition for reconnecting -- send will always wait for recv
to complete, so checking send.isDone is meaningless. Instead, we rely
on the context's state and the data we've observer from the server.
- Emit OOM's 'shutdown sequence' via the Recv handle, not via BatchContext::onEvent.
This ensures the recv.done is completed correctly on EOF to unblock send's exit.
- add a protoc guard to Reconnecting -- an EOF event should not arrive in this
state.
- fixed BatchContextTest so that stream is not automatically closed when
client closes it's half
@bevzzz bevzzz requested a review from a team as a code owner March 2, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant