Skip to content

[Repository Request] New repository: neural-sparse-cpp a C++ library for neural sparse search #462

@chishui

Description

@chishui

Are you requesting a new GitHub Repository within opensearch-project GitHub Organization?

Yes

GitHub Repository Proposal

opensearch-project/neural-search#1778

GitHub Repository Additional Information

  1. What is the new GitHub repository name?

neural-sparse-cpp

  1. Project description and community value?

neural-sparse-cpp is a C++ library providing high-performance sparse vector indexing and approximate nearest neighbor (ANN) search. It offers a pluggable architecture supporting multiple sparse ANN algorithms, with SEISMIC as the first implementation. The library includes SIMD-optimized distance computation (AVX2, AVX-512, NEON, SVE), scalar quantization (8-bit/16-bit) for memory reduction, serialization/deserialization for index persistence, and Python bindings for standalone usage. It provides community value as a reusable, open-source sparse vector engine — analogous to what Faiss provides for dense vectors — that can be integrated into search systems or used independently for sparse retrieval research and applications.

  1. What user problem are you trying to solve with this new repository?

The OpenSearch Neural Search plugin currently runs sparse ANN search (SEISMIC) in pure Java on the JVM. This creates GC pressure from large in-memory index structures, cannot fully exploit platform-specific SIMD instructions for the dot product computations that dominate search time, and forces all index data onto the JVM heap competing with other OpenSearch operations. This library solves these problems by providing a native C++ implementation that can be called via JNI, moving compute-intensive operations and index data off-heap for better performance, lower memory overhead, and more predictable latency.

  1. Why do we create a new repo at this time?

The SEISMIC sparse ANN algorithm has been integrated into OpenSearch and is available for users. Now is the right time to provide a native engine alternative because: (1) the Java implementation is functional and validates the algorithm's value, but users need better performance and memory efficiency at scale; (2) the k-NN plugin has proven the native library via JNI pattern works well for OpenSearch; (3) the C++ library is a standalone component with its own build system, test suite, and release lifecycle — it does not belong inside the Java-based neural-search plugin repo.

  1. Is there any existing projects that is similar to your proposal?

Faiss (by Meta) is the closest analog but focuses exclusively on dense vector search. There is no widely-adopted open-source C++ library specifically designed for sparse vector ANN search. The SEISMIC reference implementation exists as a research artifact but is not designed as a reusable library with a pluggable architecture, serialization, quantization, or production-grade build/test infrastructure.

  1. Should this project be in OpenSearch Core/OpenSearch Dashboards Core? If no, why not? Or, shall we combine this project to an existing repo source code in opensearch-project GitHub Org?

No. This is a standalone C++ library with its own CMake build system, C++ test suite, and Python bindings. It has a fundamentally different language, toolchain, and release lifecycle from the Java-based neural-search plugin or OpenSearch core. Combining it into an existing Java repo would create unnecessary build complexity. The k-NN plugin follows the same pattern — Faiss and nmslib are separate native libraries linked via JNI, not embedded in the k-NN Java repo.

  1. Is this project an OpenSearch/OpenSearch Dashboards plugin to be included as part of the OpenSearch release?

No. This is a native C++ library that will be consumed as a dependency by the Neural Search plugin (via JNI), similar to how the k-NN plugin consumes Faiss. It is not itself a plugin. It will be packaged as a shared library (.so/.dylib) and bundled with the Neural Search plugin's release artifacts.

GitHub Repository Owners

  1. Who will be supporting this repo going forward?
  • @chishui maintainer of neural-search plugin, the main contributor of SEISMIC algorithm in neural-search
  • @yuye-aws maintainer of neural-search plugin, the main contributor of SEISMIC algorithm in neural-search
  • @zirui-song-18 the main contributor of SEISMIC algorithm in neural-search
  • @model-collapse maintainer of neural-search plugin
  1. What is your plan (including staffing) to be responsive to the community (at a minimum, this should include reviewing PRs, responding to issues, answering forum questions?)
  • Actively monitoring GitHub issues and PRs in this repository.
  • Reviewing contributions and providing timely feedback.
  • Triaging bugs and enhancement requests regularly.
  • Keeping documentation up-to-date to help contributors and users understand how to use and contribute to the grammar package.
  1. Initial Maintainers List (max 3 users, provide GitHub aliases):

GitHub Repository Source Code / License / Libraries

  1. Please provide the URL to the source code.
    https://github.com/chishui/amaiss
  2. What is the license for the source code?
    Apache License 2.0
  3. Does the source code include any third-party code that is not compliant with the Apache License 2.0?
    No

What is the publication target(s)?

You can choose multiple targets from the list.

No response

Notes (DO NOT CHANGE)

Next Steps:

  • If this is about creating a new GitHub Repository

    • Build Interest Group (BIG) and its members will review your proposal and provide feedback
      • Review of Proposal, asking questions, adding comments
      • If there is any concern regarding the naming / IP, additional IP review will be requested
      • Involve Subject Matter Experts from other repositories on the proposed topics
      • Ensure new repositories align with the foundation’s charter
      • Review the provided source code if any
      • Send final feedback and recommendations to the Technical Steering Committee
    • Technical Steering Committee (TSC) will have a vote based on BIG feedback, and reply back the vote as a comment in this issue by a TSC member
    • At least three positive (+1) TSC members' votes are necessary, and no vetoes (-1) after a one week period, then Admin Team will open a repo creation ticket with Linux Foundation
    • Linux Foundation verify the votes and create repo
    • Admin Team setup automations on repo settings, secrets, scanning, add initial maintainers, and more
    • Repository delivered to the original requester
  • If you already have a GitHub repo and just want to add new publication target(s)

    • Admin Team will review your request and follow up

Track the progress of your request here: Engineering Effectiveness Board (view).
Member of @opensearch-project/admin will take a look at the request soon.
Thanks!

Metadata

Metadata

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions