[Feature]:  Decoupling Ingestion from Indexing & Summarization

### PR Discussion: Decoupling Ingestion from Indexing & Summarization
## 1. Background & Motivation
Currently, the add_resource interface operates as a tightly coupled "black box" process. When a user uploads a file, the system forces a sequential execution of: Parse -> Ingest -> Vectorize -> Summarize .

While convenient for small-scale operations, this design has significant limitations in the following scenarios:

- ETL & Big Data Scenarios : Users may want to rapidly ingest large volumes of raw data via engines like Spark or Daft first (Ingestion phase), deferring the computationally expensive vectorization and LLM summarization to a later time or a separate compute cluster.
- Cost Control : Semantic summarization consumes LLM tokens. Users should have the option to store and index files without incurring the cost of generating summaries.
- Architectural Flexibility : To enable OpenViking to function as a library within distributed data pipelines, we need to decouple "writing data" from "processing data".
## 2. Design Proposal
The core logic of this refactor is to separate Ingestion and Post-processing into distinct lifecycle stages, controlled via a unified orchestration layer ( ResourceProcessor ).

### 2.1 Core Decoupling
We have decomposed the original ResourceProcessor into three independent components:

1. TreeBuilder (Ingestion) : Handles file system operations. Moves parsed temporary files to their target location in VikingFS (L0/L1/L2 structure). This is the foundational step.
2. IndexBuilder (Vectorization) : Scans specified resource directories, extracts text chunks, and builds the vector index.
3. Summarizer (Semantic Analysis) : Invokes the VLM/LLM to generate .abstract.md and .overview.md for resources.
### 2.2 Interface Changes
A. Enhanced Control in add_resource Maintained backward compatibility while adding fine-grained control parameters:

```
await client.add_resource(
    path="doc.pdf",
    build_index=True,  # Default: True. 
    Set to False to store without 
    indexing.
    summarize=False    # Default: False. 
    Controls whether to consume tokens 
    for summary generation.
)
```
B. New Standalone Trigger Interfaces Exposed atomic operation interfaces, allowing users to trigger processing manually at any time after ingestion:

```
# Manually trigger index building
await client.build_index(resource_uris=
["viking://resources/doc_v1"])

# Manually trigger semantic summarization
await client.summarize(resource_uris=
["viking://resources/doc_v1"])
```
## 3. Implementation Details
- ResourceProcessor Refactor :
  - Removed hardcoded execution flows.
  - The process_resource method now acts as an Orchestrator . It first calls TreeBuilder to finalize physical file storage (mandatory), then conditionally calls IndexBuilder and Summarizer based on the build_index and summarize flags.
- Async Compatibility :
  - To support future distributed scheduling, IndexBuilder and Summarizer are designed to be stateless, depending only on resource_uri . This means tasks can be serialized and distributed to Ray or Spark workers (though the current implementation runs them in local async queues).
- Test Coverage :
  - Added tests/client/test_index_control.py , covering various combinations such as "store only (no index)", "store without summary", and "manual index triggering" to ensure the logic of the control flags is correct.
## 4. Benefits
1. Flexibility : Enables an "Ingest Now, Index Later" pattern.
2. Observability : By separating indexing and summarization, it becomes easier to monitor latency and errors for each specific stage.
3. Scalability : Paves the way for integration with distributed compute engines (e.g., Daft/Spark). External engines can simply call add_resource(build_index=False) for rapid ingestion and parallelize build_index tasks separately.

### Proposed Solution

### PR Discussion: Decoupling Ingestion from Indexing & Summarization
## 1. Background & Motivation
Currently, the add_resource interface operates as a tightly coupled "black box" process. When a user uploads a file, the system forces a sequential execution of: Parse -> Ingest -> Vectorize -> Summarize .

While convenient for small-scale operations, this design has significant limitations in the following scenarios:

- ETL & Big Data Scenarios : Users may want to rapidly ingest large volumes of raw data via engines like Spark or Daft first (Ingestion phase), deferring the computationally expensive vectorization and LLM summarization to a later time or a separate compute cluster.
- Cost Control : Semantic summarization consumes LLM tokens. Users should have the option to store and index files without incurring the cost of generating summaries.
- Architectural Flexibility : To enable OpenViking to function as a library within distributed data pipelines, we need to decouple "writing data" from "processing data".
## 2. Design Proposal
The core logic of this refactor is to separate Ingestion and Post-processing into distinct lifecycle stages, controlled via a unified orchestration layer ( ResourceProcessor ).

### 2.1 Core Decoupling
We have decomposed the original ResourceProcessor into three independent components:

1. TreeBuilder (Ingestion) : Handles file system operations. Moves parsed temporary files to their target location in VikingFS (L0/L1/L2 structure). This is the foundational step.
2. IndexBuilder (Vectorization) : Scans specified resource directories, extracts text chunks, and builds the vector index.
3. Summarizer (Semantic Analysis) : Invokes the VLM/LLM to generate .abstract.md and .overview.md for resources.
### 2.2 Interface Changes
A. Enhanced Control in add_resource Maintained backward compatibility while adding fine-grained control parameters:

```
await client.add_resource(
    path="doc.pdf",
    build_index=True,  # Default: True. 
    Set to False to store without 
    indexing.
    summarize=False    # Default: False. 
    Controls whether to consume tokens 
    for summary generation.
)
```
B. New Standalone Trigger Interfaces Exposed atomic operation interfaces, allowing users to trigger processing manually at any time after ingestion:

```
# Manually trigger index building
await client.build_index(resource_uris=
["viking://resources/doc_v1"])

# Manually trigger semantic summarization
await client.summarize(resource_uris=
["viking://resources/doc_v1"])
```
## 3. Implementation Details
- ResourceProcessor Refactor :
  - Removed hardcoded execution flows.
  - The process_resource method now acts as an Orchestrator . It first calls TreeBuilder to finalize physical file storage (mandatory), then conditionally calls IndexBuilder and Summarizer based on the build_index and summarize flags.
- Async Compatibility :
  - To support future distributed scheduling, IndexBuilder and Summarizer are designed to be stateless, depending only on resource_uri . This means tasks can be serialized and distributed to Ray or Spark workers (though the current implementation runs them in local async queues).
- Test Coverage :
  - Added tests/client/test_index_control.py , covering various combinations such as "store only (no index)", "store without summary", and "manual index triggering" to ensure the logic of the control flags is correct.
## 4. Benefits
1. Flexibility : Enables an "Ingest Now, Index Later" pattern.
2. Observability : By separating indexing and summarization, it becomes easier to monitor latency and errors for each specific stage.
3. Scalability : Paves the way for integration with distributed compute engines (e.g., Daft/Spark). External engines can simply call add_resource(build_index=False) for rapid ingestion and parallelize build_index tasks separately.

### Alternatives Considered

_No response_

### Feature Area

Core (Client/Engine)

### Use Case

As above

### Example API (Optional)

```python

```

### Additional Context

_No response_

### Contribution

- [ ] I am willing to contribute to implementing this feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Decoupling Ingestion from Indexing & Summarization #350

PR Discussion: Decoupling Ingestion from Indexing & Summarization

1. Background & Motivation

2. Design Proposal

2.1 Core Decoupling

2.2 Interface Changes

3. Implementation Details

4. Benefits

Proposed Solution

PR Discussion: Decoupling Ingestion from Indexing & Summarization

1. Background & Motivation

2. Design Proposal

2.1 Core Decoupling

2.2 Interface Changes

3. Implementation Details

4. Benefits

Alternatives Considered

Feature Area

Use Case

Example API (Optional)

Additional Context

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Decoupling Ingestion from Indexing & Summarization #350

Description

PR Discussion: Decoupling Ingestion from Indexing & Summarization

1. Background & Motivation

2. Design Proposal

2.1 Core Decoupling

2.2 Interface Changes

3. Implementation Details

4. Benefits

Proposed Solution

PR Discussion: Decoupling Ingestion from Indexing & Summarization

1. Background & Motivation

2. Design Proposal

2.1 Core Decoupling

2.2 Interface Changes

3. Implementation Details

4. Benefits

Alternatives Considered

Feature Area

Use Case

Example API (Optional)

Additional Context

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions