-
Notifications
You must be signed in to change notification settings - Fork 350
Description
PR Discussion: Decoupling Ingestion from Indexing & Summarization
1. Background & Motivation
Currently, the add_resource interface operates as a tightly coupled "black box" process. When a user uploads a file, the system forces a sequential execution of: Parse -> Ingest -> Vectorize -> Summarize .
While convenient for small-scale operations, this design has significant limitations in the following scenarios:
- ETL & Big Data Scenarios : Users may want to rapidly ingest large volumes of raw data via engines like Spark or Daft first (Ingestion phase), deferring the computationally expensive vectorization and LLM summarization to a later time or a separate compute cluster.
- Cost Control : Semantic summarization consumes LLM tokens. Users should have the option to store and index files without incurring the cost of generating summaries.
- Architectural Flexibility : To enable OpenViking to function as a library within distributed data pipelines, we need to decouple "writing data" from "processing data".
2. Design Proposal
The core logic of this refactor is to separate Ingestion and Post-processing into distinct lifecycle stages, controlled via a unified orchestration layer ( ResourceProcessor ).
2.1 Core Decoupling
We have decomposed the original ResourceProcessor into three independent components:
- TreeBuilder (Ingestion) : Handles file system operations. Moves parsed temporary files to their target location in VikingFS (L0/L1/L2 structure). This is the foundational step.
- IndexBuilder (Vectorization) : Scans specified resource directories, extracts text chunks, and builds the vector index.
- Summarizer (Semantic Analysis) : Invokes the VLM/LLM to generate .abstract.md and .overview.md for resources.
2.2 Interface Changes
A. Enhanced Control in add_resource Maintained backward compatibility while adding fine-grained control parameters:
await client.add_resource(
path="doc.pdf",
build_index=True, # Default: True.
Set to False to store without
indexing.
summarize=False # Default: False.
Controls whether to consume tokens
for summary generation.
)
B. New Standalone Trigger Interfaces Exposed atomic operation interfaces, allowing users to trigger processing manually at any time after ingestion:
# Manually trigger index building
await client.build_index(resource_uris=
["viking://resources/doc_v1"])
# Manually trigger semantic summarization
await client.summarize(resource_uris=
["viking://resources/doc_v1"])
3. Implementation Details
- ResourceProcessor Refactor :
- Removed hardcoded execution flows.
- The process_resource method now acts as an Orchestrator . It first calls TreeBuilder to finalize physical file storage (mandatory), then conditionally calls IndexBuilder and Summarizer based on the build_index and summarize flags.
- Async Compatibility :
- To support future distributed scheduling, IndexBuilder and Summarizer are designed to be stateless, depending only on resource_uri . This means tasks can be serialized and distributed to Ray or Spark workers (though the current implementation runs them in local async queues).
- Test Coverage :
- Added tests/client/test_index_control.py , covering various combinations such as "store only (no index)", "store without summary", and "manual index triggering" to ensure the logic of the control flags is correct.
4. Benefits
- Flexibility : Enables an "Ingest Now, Index Later" pattern.
- Observability : By separating indexing and summarization, it becomes easier to monitor latency and errors for each specific stage.
- Scalability : Paves the way for integration with distributed compute engines (e.g., Daft/Spark). External engines can simply call add_resource(build_index=False) for rapid ingestion and parallelize build_index tasks separately.
Proposed Solution
PR Discussion: Decoupling Ingestion from Indexing & Summarization
1. Background & Motivation
Currently, the add_resource interface operates as a tightly coupled "black box" process. When a user uploads a file, the system forces a sequential execution of: Parse -> Ingest -> Vectorize -> Summarize .
While convenient for small-scale operations, this design has significant limitations in the following scenarios:
- ETL & Big Data Scenarios : Users may want to rapidly ingest large volumes of raw data via engines like Spark or Daft first (Ingestion phase), deferring the computationally expensive vectorization and LLM summarization to a later time or a separate compute cluster.
- Cost Control : Semantic summarization consumes LLM tokens. Users should have the option to store and index files without incurring the cost of generating summaries.
- Architectural Flexibility : To enable OpenViking to function as a library within distributed data pipelines, we need to decouple "writing data" from "processing data".
2. Design Proposal
The core logic of this refactor is to separate Ingestion and Post-processing into distinct lifecycle stages, controlled via a unified orchestration layer ( ResourceProcessor ).
2.1 Core Decoupling
We have decomposed the original ResourceProcessor into three independent components:
- TreeBuilder (Ingestion) : Handles file system operations. Moves parsed temporary files to their target location in VikingFS (L0/L1/L2 structure). This is the foundational step.
- IndexBuilder (Vectorization) : Scans specified resource directories, extracts text chunks, and builds the vector index.
- Summarizer (Semantic Analysis) : Invokes the VLM/LLM to generate .abstract.md and .overview.md for resources.
2.2 Interface Changes
A. Enhanced Control in add_resource Maintained backward compatibility while adding fine-grained control parameters:
await client.add_resource(
path="doc.pdf",
build_index=True, # Default: True.
Set to False to store without
indexing.
summarize=False # Default: False.
Controls whether to consume tokens
for summary generation.
)
B. New Standalone Trigger Interfaces Exposed atomic operation interfaces, allowing users to trigger processing manually at any time after ingestion:
# Manually trigger index building
await client.build_index(resource_uris=
["viking://resources/doc_v1"])
# Manually trigger semantic summarization
await client.summarize(resource_uris=
["viking://resources/doc_v1"])
3. Implementation Details
- ResourceProcessor Refactor :
- Removed hardcoded execution flows.
- The process_resource method now acts as an Orchestrator . It first calls TreeBuilder to finalize physical file storage (mandatory), then conditionally calls IndexBuilder and Summarizer based on the build_index and summarize flags.
- Async Compatibility :
- To support future distributed scheduling, IndexBuilder and Summarizer are designed to be stateless, depending only on resource_uri . This means tasks can be serialized and distributed to Ray or Spark workers (though the current implementation runs them in local async queues).
- Test Coverage :
- Added tests/client/test_index_control.py , covering various combinations such as "store only (no index)", "store without summary", and "manual index triggering" to ensure the logic of the control flags is correct.
4. Benefits
- Flexibility : Enables an "Ingest Now, Index Later" pattern.
- Observability : By separating indexing and summarization, it becomes easier to monitor latency and errors for each specific stage.
- Scalability : Paves the way for integration with distributed compute engines (e.g., Daft/Spark). External engines can simply call add_resource(build_index=False) for rapid ingestion and parallelize build_index tasks separately.
Alternatives Considered
No response
Feature Area
Core (Client/Engine)
Use Case
As above
Example API (Optional)
Additional Context
No response
Contribution
- I am willing to contribute to implementing this feature
Metadata
Metadata
Assignees
Labels
Type
Projects
Status