Skip to content

[Feature]: Decoupling Ingestion from Indexing & Summarization #350

@Jay-ju

Description

@Jay-ju

PR Discussion: Decoupling Ingestion from Indexing & Summarization

1. Background & Motivation

Currently, the add_resource interface operates as a tightly coupled "black box" process. When a user uploads a file, the system forces a sequential execution of: Parse -> Ingest -> Vectorize -> Summarize .

While convenient for small-scale operations, this design has significant limitations in the following scenarios:

  • ETL & Big Data Scenarios : Users may want to rapidly ingest large volumes of raw data via engines like Spark or Daft first (Ingestion phase), deferring the computationally expensive vectorization and LLM summarization to a later time or a separate compute cluster.
  • Cost Control : Semantic summarization consumes LLM tokens. Users should have the option to store and index files without incurring the cost of generating summaries.
  • Architectural Flexibility : To enable OpenViking to function as a library within distributed data pipelines, we need to decouple "writing data" from "processing data".

2. Design Proposal

The core logic of this refactor is to separate Ingestion and Post-processing into distinct lifecycle stages, controlled via a unified orchestration layer ( ResourceProcessor ).

2.1 Core Decoupling

We have decomposed the original ResourceProcessor into three independent components:

  1. TreeBuilder (Ingestion) : Handles file system operations. Moves parsed temporary files to their target location in VikingFS (L0/L1/L2 structure). This is the foundational step.
  2. IndexBuilder (Vectorization) : Scans specified resource directories, extracts text chunks, and builds the vector index.
  3. Summarizer (Semantic Analysis) : Invokes the VLM/LLM to generate .abstract.md and .overview.md for resources.

2.2 Interface Changes

A. Enhanced Control in add_resource Maintained backward compatibility while adding fine-grained control parameters:

await client.add_resource(
    path="doc.pdf",
    build_index=True,  # Default: True. 
    Set to False to store without 
    indexing.
    summarize=False    # Default: False. 
    Controls whether to consume tokens 
    for summary generation.
)

B. New Standalone Trigger Interfaces Exposed atomic operation interfaces, allowing users to trigger processing manually at any time after ingestion:

# Manually trigger index building
await client.build_index(resource_uris=
["viking://resources/doc_v1"])

# Manually trigger semantic summarization
await client.summarize(resource_uris=
["viking://resources/doc_v1"])

3. Implementation Details

  • ResourceProcessor Refactor :
    • Removed hardcoded execution flows.
    • The process_resource method now acts as an Orchestrator . It first calls TreeBuilder to finalize physical file storage (mandatory), then conditionally calls IndexBuilder and Summarizer based on the build_index and summarize flags.
  • Async Compatibility :
    • To support future distributed scheduling, IndexBuilder and Summarizer are designed to be stateless, depending only on resource_uri . This means tasks can be serialized and distributed to Ray or Spark workers (though the current implementation runs them in local async queues).
  • Test Coverage :
    • Added tests/client/test_index_control.py , covering various combinations such as "store only (no index)", "store without summary", and "manual index triggering" to ensure the logic of the control flags is correct.

4. Benefits

  1. Flexibility : Enables an "Ingest Now, Index Later" pattern.
  2. Observability : By separating indexing and summarization, it becomes easier to monitor latency and errors for each specific stage.
  3. Scalability : Paves the way for integration with distributed compute engines (e.g., Daft/Spark). External engines can simply call add_resource(build_index=False) for rapid ingestion and parallelize build_index tasks separately.

Proposed Solution

PR Discussion: Decoupling Ingestion from Indexing & Summarization

1. Background & Motivation

Currently, the add_resource interface operates as a tightly coupled "black box" process. When a user uploads a file, the system forces a sequential execution of: Parse -> Ingest -> Vectorize -> Summarize .

While convenient for small-scale operations, this design has significant limitations in the following scenarios:

  • ETL & Big Data Scenarios : Users may want to rapidly ingest large volumes of raw data via engines like Spark or Daft first (Ingestion phase), deferring the computationally expensive vectorization and LLM summarization to a later time or a separate compute cluster.
  • Cost Control : Semantic summarization consumes LLM tokens. Users should have the option to store and index files without incurring the cost of generating summaries.
  • Architectural Flexibility : To enable OpenViking to function as a library within distributed data pipelines, we need to decouple "writing data" from "processing data".

2. Design Proposal

The core logic of this refactor is to separate Ingestion and Post-processing into distinct lifecycle stages, controlled via a unified orchestration layer ( ResourceProcessor ).

2.1 Core Decoupling

We have decomposed the original ResourceProcessor into three independent components:

  1. TreeBuilder (Ingestion) : Handles file system operations. Moves parsed temporary files to their target location in VikingFS (L0/L1/L2 structure). This is the foundational step.
  2. IndexBuilder (Vectorization) : Scans specified resource directories, extracts text chunks, and builds the vector index.
  3. Summarizer (Semantic Analysis) : Invokes the VLM/LLM to generate .abstract.md and .overview.md for resources.

2.2 Interface Changes

A. Enhanced Control in add_resource Maintained backward compatibility while adding fine-grained control parameters:

await client.add_resource(
    path="doc.pdf",
    build_index=True,  # Default: True. 
    Set to False to store without 
    indexing.
    summarize=False    # Default: False. 
    Controls whether to consume tokens 
    for summary generation.
)

B. New Standalone Trigger Interfaces Exposed atomic operation interfaces, allowing users to trigger processing manually at any time after ingestion:

# Manually trigger index building
await client.build_index(resource_uris=
["viking://resources/doc_v1"])

# Manually trigger semantic summarization
await client.summarize(resource_uris=
["viking://resources/doc_v1"])

3. Implementation Details

  • ResourceProcessor Refactor :
    • Removed hardcoded execution flows.
    • The process_resource method now acts as an Orchestrator . It first calls TreeBuilder to finalize physical file storage (mandatory), then conditionally calls IndexBuilder and Summarizer based on the build_index and summarize flags.
  • Async Compatibility :
    • To support future distributed scheduling, IndexBuilder and Summarizer are designed to be stateless, depending only on resource_uri . This means tasks can be serialized and distributed to Ray or Spark workers (though the current implementation runs them in local async queues).
  • Test Coverage :
    • Added tests/client/test_index_control.py , covering various combinations such as "store only (no index)", "store without summary", and "manual index triggering" to ensure the logic of the control flags is correct.

4. Benefits

  1. Flexibility : Enables an "Ingest Now, Index Later" pattern.
  2. Observability : By separating indexing and summarization, it becomes easier to monitor latency and errors for each specific stage.
  3. Scalability : Paves the way for integration with distributed compute engines (e.g., Daft/Spark). External engines can simply call add_resource(build_index=False) for rapid ingestion and parallelize build_index tasks separately.

Alternatives Considered

No response

Feature Area

Core (Client/Engine)

Use Case

As above

Example API (Optional)

Additional Context

No response

Contribution

  • I am willing to contribute to implementing this feature

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions