Skip to content

Latest commit

 

History

History
203 lines (146 loc) · 6.82 KB

File metadata and controls

203 lines (146 loc) · 6.82 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Chunkaroo is a TypeScript-first text chunking library designed for RAG (Retrieval-Augmented Generation) applications. It provides multiple chunking strategies with a focus on performance, flexibility, and type safety.

Key characteristics:

  • Monorepo structure using Turborepo and pnpm
  • Main package: packages/chunkaroo/
  • Strategy-based architecture with 7 chunking strategies
  • Heavy use of TypeScript generics for type-safe metadata handling
  • Test coverage target: 80%+ (enforced via vitest config)

Development Commands

Running Tests

# Run all tests
pnpm test

# Watch mode (useful during development)
pnpm test:watch

# Single test file
pnpm --filter chunkaroo test src/chunk/strategies/__tests__/markdown.test.ts

# Coverage report
pnpm test:coverage

Build & Lint

# Build (TypeScript compilation)
pnpm build

# Lint all files
pnpm lint

# Lint with auto-fix
pnpm lint:fix

Package-Specific Commands

# Work in specific package
pnpm --filter chunkaroo <command>

# Example: run tests in chunkaroo package only
pnpm --filter chunkaroo test

Architecture

Core Concepts

Strategy Pattern: The library uses a strategy pattern where each chunking method is a separate strategy (sentence, recursive, token, markdown, etc.). All strategies:

  1. Accept text and options
  2. Return Chunk<Metadata>[] where Metadata extends BaseChunkMetadata
  3. Use the shared postProcessChunks() function from chunk-processor.ts

Post-Processing Pipeline: After chunking, all strategies go through a unified post-processing phase (in chunk-processor.ts) that handles:

  1. Chunk ID generation
  2. Previous/next chunk references
  3. Overlap (with smart word-boundary detection)
  4. Custom post-processors (functions that transform chunks)

Length Functions: Chunks are measured using lengthFunction (defaults to character count, but can be async token counting via tiktoken). This is used for chunkSize and overlap calculations.

Key Files & Their Roles

  • src/chunk/chunk.ts: Main entry point - the chunk() function dispatches to appropriate strategy
  • src/chunk/chunk-processor.ts: Post-processing logic shared by all strategies (overlap, references, post-processors)
  • src/types.ts: Core types - Chunk, BaseChunkMetadata, BaseChunkingOptions, LengthFunction
  • src/chunk/strategies/: Individual chunking strategies (each has its own options and metadata types)

Chunking Strategies

  1. sentence: Splits on sentence boundaries
  2. recursive: Character-based recursive splitting with configurable separators
  3. token: Token-aware chunking (uses length function for token counting)
  4. json: Preserves JSON structure while chunking
  5. markdown: Header-aware chunking with context preservation (500 LOC, recently simplified)
  6. semantic: Embedding-based semantic chunking
  7. semantic-double-pass: Two-pass semantic chunking for better quality

Each strategy exports:

  • chunkBy<Strategy>() function
  • <Strategy>ChunkingOptions type (extends BaseChunkingOptions)
  • <Strategy>ChunkMetadata type (extends BaseChunkMetadata)

Markdown Strategy (Recently Refactored)

The markdown chunker is header-based with token-aware merging:

  1. Split by headers (h1-h6) using regex
  2. Protect code blocks and tables from splitting
  3. Merge chunks bottom-up by depth to reach target chunkSize
  4. Track heading hierarchy for context
  5. Support context headers (breadcrumb, full, parent-only modes)
  6. Parse front matter

See packages/chunkaroo/src/chunk/strategies/markdown/markdown.ts for implementation (reduced from 1200 → 500 lines).

Type Safety Pattern

All strategies use TypeScript generics to ensure type-safe metadata:

export async function chunkByMarkdown(
  text: string,
  options: MarkdownChunkingOptions,
): Promise<Chunk<MarkdownChunkMetadata>[]>

The chunk() wrapper function preserves these types through the StrategyRegistry type.

Utilities

  • utils/regex-cache.ts: Centralized regex caching with concurrency safety (global regexes are cloned to prevent race conditions)
  • utils/semantic-helpers.ts: Similarity calculations (cosine similarity) for semantic chunking
  • utils/split-into-segments.ts: Text splitting utilities
  • utils/config.ts: Global configuration management
  • utils/logger.ts: Logging utilities

Known Issues & Patterns

Overlap Behavior

IMPORTANT: The overlap implementation has known issues (tracked in TODO.md):

  • Token strategy overlap may be incorrect
  • Overlap is added in post-processing (not strategy-aware)
  • Chunks become approximately chunkSize + overlap in size
  • Smart word-boundary detection is used to avoid breaking words

When overlap > 0, each chunk (except first) gets text from the previous chunk prepended.

Length Function Usage

The lengthFunction should primarily be used for checking chunk size limits, NOT for calculating start/end indices. There's a known TODO to audit and fix this usage pattern across strategies.

Testing Patterns

Tests are co-located in __tests__/ directories:

  • src/chunk/strategies/__tests__/ for strategy tests
  • src/utils/__tests__/ for utility tests
  • Use __snapshots__/ for snapshot testing where appropriate

Coverage thresholds are enforced at 80% for branches, functions, lines, and statements.

Regex Safety

When working with regexes:

  • Use regex-cache.ts for frequently-used patterns
  • Global regexes must be cloned before returning (see getGlobalRegex())
  • This prevents race conditions in concurrent chunking scenarios

Common Tasks

Adding a New Chunking Strategy

  1. Create src/chunk/strategies/my-strategy.ts
  2. Define MyStrategyChunkMetadata extending BaseChunkMetadata
  3. Define MyStrategyChunkingOptions extending BaseChunkingOptions
  4. Implement chunkByMyStrategy() function
  5. Call postProcessChunks() before returning
  6. Add to StrategyRegistry in chunk.ts
  7. Add case to switch statement in chunk() function
  8. Create tests in __tests__/my-strategy.test.ts

Adding a Post-Processor

Post-processors transform chunks after main processing:

export function myProcessor(
  chunk: Chunk,
  index: number,
  chunks: Chunk[]
): Chunk {
  return {
    ...chunk,
    metadata: {
      ...chunk.metadata,
      customField: 'value'
    }
  };
}

See src/chunk/post-processors/add-context-headers.ts for reference implementation.

Working with Metadata

To add custom metadata fields:

  1. Extend the strategy's metadata interface
  2. Populate fields before calling postProcessChunks()
  3. Ensure metadata is preserved through post-processing

Base metadata always includes: id, startIndex, endIndex, lines, previousChunkId, nextChunkId