CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Chunkaroo is a TypeScript-first text chunking library designed for RAG (Retrieval-Augmented Generation) applications. It provides multiple chunking strategies with a focus on performance, flexibility, and type safety.

Key characteristics:

Monorepo structure using Turborepo and pnpm
Main package: packages/chunkaroo/
Strategy-based architecture with 7 chunking strategies
Heavy use of TypeScript generics for type-safe metadata handling
Test coverage target: 80%+ (enforced via vitest config)

Development Commands

Running Tests

# Run all tests
pnpm test

# Watch mode (useful during development)
pnpm test:watch

# Single test file
pnpm --filter chunkaroo test src/chunk/strategies/__tests__/markdown.test.ts

# Coverage report
pnpm test:coverage

Build & Lint

# Build (TypeScript compilation)
pnpm build

# Lint all files
pnpm lint

# Lint with auto-fix
pnpm lint:fix

Package-Specific Commands

# Work in specific package
pnpm --filter chunkaroo <command>

# Example: run tests in chunkaroo package only
pnpm --filter chunkaroo test

Architecture

Core Concepts

Strategy Pattern: The library uses a strategy pattern where each chunking method is a separate strategy (sentence, recursive, token, markdown, etc.). All strategies:

Accept text and options
Return Chunk<Metadata>[] where Metadata extends BaseChunkMetadata
Use the shared postProcessChunks() function from chunk-processor.ts

Post-Processing Pipeline: After chunking, all strategies go through a unified post-processing phase (in chunk-processor.ts) that handles:

Chunk ID generation
Previous/next chunk references
Overlap (with smart word-boundary detection)
Custom post-processors (functions that transform chunks)

Length Functions: Chunks are measured using lengthFunction (defaults to character count, but can be async token counting via tiktoken). This is used for chunkSize and overlap calculations.

Key Files & Their Roles

src/chunk/chunk.ts: Main entry point - the chunk() function dispatches to appropriate strategy
src/chunk/chunk-processor.ts: Post-processing logic shared by all strategies (overlap, references, post-processors)
src/types.ts: Core types - Chunk, BaseChunkMetadata, BaseChunkingOptions, LengthFunction
src/chunk/strategies/: Individual chunking strategies (each has its own options and metadata types)

Chunking Strategies

sentence: Splits on sentence boundaries
recursive: Character-based recursive splitting with configurable separators
token: Token-aware chunking (uses length function for token counting)
json: Preserves JSON structure while chunking
markdown: Header-aware chunking with context preservation (500 LOC, recently simplified)
semantic: Embedding-based semantic chunking
semantic-double-pass: Two-pass semantic chunking for better quality

Each strategy exports:

chunkBy<Strategy>() function
<Strategy>ChunkingOptions type (extends BaseChunkingOptions)
<Strategy>ChunkMetadata type (extends BaseChunkMetadata)

Markdown Strategy (Recently Refactored)

The markdown chunker is header-based with token-aware merging:

Split by headers (h1-h6) using regex
Protect code blocks and tables from splitting
Merge chunks bottom-up by depth to reach target chunkSize
Track heading hierarchy for context
Support context headers (breadcrumb, full, parent-only modes)
Parse front matter

See packages/chunkaroo/src/chunk/strategies/markdown/markdown.ts for implementation (reduced from 1200 → 500 lines).

Type Safety Pattern

All strategies use TypeScript generics to ensure type-safe metadata:

export async function chunkByMarkdown(
  text: string,
  options: MarkdownChunkingOptions,
): Promise<Chunk<MarkdownChunkMetadata>[]>

The chunk() wrapper function preserves these types through the StrategyRegistry type.

Utilities

utils/regex-cache.ts: Centralized regex caching with concurrency safety (global regexes are cloned to prevent race conditions)
utils/semantic-helpers.ts: Similarity calculations (cosine similarity) for semantic chunking
utils/split-into-segments.ts: Text splitting utilities
utils/config.ts: Global configuration management
utils/logger.ts: Logging utilities

Known Issues & Patterns

Overlap Behavior

IMPORTANT: The overlap implementation has known issues (tracked in TODO.md):

Token strategy overlap may be incorrect
Overlap is added in post-processing (not strategy-aware)
Chunks become approximately chunkSize + overlap in size
Smart word-boundary detection is used to avoid breaking words

When overlap > 0, each chunk (except first) gets text from the previous chunk prepended.

Length Function Usage

The lengthFunction should primarily be used for checking chunk size limits, NOT for calculating start/end indices. There's a known TODO to audit and fix this usage pattern across strategies.

Testing Patterns

Tests are co-located in __tests__/ directories:

src/chunk/strategies/__tests__/ for strategy tests
src/utils/__tests__/ for utility tests
Use __snapshots__/ for snapshot testing where appropriate

Coverage thresholds are enforced at 80% for branches, functions, lines, and statements.

Regex Safety

When working with regexes:

Use regex-cache.ts for frequently-used patterns
Global regexes must be cloned before returning (see getGlobalRegex())
This prevents race conditions in concurrent chunking scenarios

Common Tasks

Adding a New Chunking Strategy

Create src/chunk/strategies/my-strategy.ts
Define MyStrategyChunkMetadata extending BaseChunkMetadata
Define MyStrategyChunkingOptions extending BaseChunkingOptions
Implement chunkByMyStrategy() function
Call postProcessChunks() before returning
Add to StrategyRegistry in chunk.ts
Add case to switch statement in chunk() function
Create tests in __tests__/my-strategy.test.ts

Adding a Post-Processor

Post-processors transform chunks after main processing:

export function myProcessor(
  chunk: Chunk,
  index: number,
  chunks: Chunk[]
): Chunk {
  return {
    ...chunk,
    metadata: {
      ...chunk.metadata,
      customField: 'value'
    }
  };
}

See src/chunk/post-processors/add-context-headers.ts for reference implementation.

Working with Metadata

To add custom metadata fields:

Extend the strategy's metadata interface
Populate fields before calling postProcessChunks()
Ensure metadata is preserved through post-processing

Base metadata always includes: id, startIndex, endIndex, lines, previousChunkId, nextChunkId

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Development Commands

Running Tests

Build & Lint

Package-Specific Commands

Architecture

Core Concepts

Key Files & Their Roles

Chunking Strategies

Markdown Strategy (Recently Refactored)

Type Safety Pattern

Utilities

Known Issues & Patterns

Overlap Behavior

Length Function Usage

Testing Patterns

Regex Safety

Common Tasks

Adding a New Chunking Strategy

Adding a Post-Processor

Working with Metadata

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Development Commands

Running Tests

Build & Lint

Package-Specific Commands

Architecture

Core Concepts

Key Files & Their Roles

Chunking Strategies

Markdown Strategy (Recently Refactored)

Type Safety Pattern

Utilities

Known Issues & Patterns

Overlap Behavior

Length Function Usage

Testing Patterns

Regex Safety

Common Tasks

Adding a New Chunking Strategy

Adding a Post-Processor

Working with Metadata