This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Chunkaroo is a TypeScript-first text chunking library designed for RAG (Retrieval-Augmented Generation) applications. It provides multiple chunking strategies with a focus on performance, flexibility, and type safety.
Key characteristics:
- Monorepo structure using Turborepo and pnpm
- Main package:
packages/chunkaroo/ - Strategy-based architecture with 7 chunking strategies
- Heavy use of TypeScript generics for type-safe metadata handling
- Test coverage target: 80%+ (enforced via vitest config)
# Run all tests
pnpm test
# Watch mode (useful during development)
pnpm test:watch
# Single test file
pnpm --filter chunkaroo test src/chunk/strategies/__tests__/markdown.test.ts
# Coverage report
pnpm test:coverage# Build (TypeScript compilation)
pnpm build
# Lint all files
pnpm lint
# Lint with auto-fix
pnpm lint:fix# Work in specific package
pnpm --filter chunkaroo <command>
# Example: run tests in chunkaroo package only
pnpm --filter chunkaroo testStrategy Pattern: The library uses a strategy pattern where each chunking method is a separate strategy (sentence, recursive, token, markdown, etc.). All strategies:
- Accept text and options
- Return
Chunk<Metadata>[]where Metadata extendsBaseChunkMetadata - Use the shared
postProcessChunks()function fromchunk-processor.ts
Post-Processing Pipeline: After chunking, all strategies go through a unified post-processing phase (in chunk-processor.ts) that handles:
- Chunk ID generation
- Previous/next chunk references
- Overlap (with smart word-boundary detection)
- Custom post-processors (functions that transform chunks)
Length Functions: Chunks are measured using lengthFunction (defaults to character count, but can be async token counting via tiktoken). This is used for chunkSize and overlap calculations.
src/chunk/chunk.ts: Main entry point - thechunk()function dispatches to appropriate strategysrc/chunk/chunk-processor.ts: Post-processing logic shared by all strategies (overlap, references, post-processors)src/types.ts: Core types -Chunk,BaseChunkMetadata,BaseChunkingOptions,LengthFunctionsrc/chunk/strategies/: Individual chunking strategies (each has its own options and metadata types)
- sentence: Splits on sentence boundaries
- recursive: Character-based recursive splitting with configurable separators
- token: Token-aware chunking (uses length function for token counting)
- json: Preserves JSON structure while chunking
- markdown: Header-aware chunking with context preservation (500 LOC, recently simplified)
- semantic: Embedding-based semantic chunking
- semantic-double-pass: Two-pass semantic chunking for better quality
Each strategy exports:
chunkBy<Strategy>()function<Strategy>ChunkingOptionstype (extendsBaseChunkingOptions)<Strategy>ChunkMetadatatype (extendsBaseChunkMetadata)
The markdown chunker is header-based with token-aware merging:
- Split by headers (h1-h6) using regex
- Protect code blocks and tables from splitting
- Merge chunks bottom-up by depth to reach target
chunkSize - Track heading hierarchy for context
- Support context headers (breadcrumb, full, parent-only modes)
- Parse front matter
See packages/chunkaroo/src/chunk/strategies/markdown/markdown.ts for implementation (reduced from 1200 → 500 lines).
All strategies use TypeScript generics to ensure type-safe metadata:
export async function chunkByMarkdown(
text: string,
options: MarkdownChunkingOptions,
): Promise<Chunk<MarkdownChunkMetadata>[]>The chunk() wrapper function preserves these types through the StrategyRegistry type.
utils/regex-cache.ts: Centralized regex caching with concurrency safety (global regexes are cloned to prevent race conditions)utils/semantic-helpers.ts: Similarity calculations (cosine similarity) for semantic chunkingutils/split-into-segments.ts: Text splitting utilitiesutils/config.ts: Global configuration managementutils/logger.ts: Logging utilities
IMPORTANT: The overlap implementation has known issues (tracked in TODO.md):
- Token strategy overlap may be incorrect
- Overlap is added in post-processing (not strategy-aware)
- Chunks become approximately
chunkSize + overlapin size - Smart word-boundary detection is used to avoid breaking words
When overlap > 0, each chunk (except first) gets text from the previous chunk prepended.
The lengthFunction should primarily be used for checking chunk size limits, NOT for calculating start/end indices. There's a known TODO to audit and fix this usage pattern across strategies.
Tests are co-located in __tests__/ directories:
src/chunk/strategies/__tests__/for strategy testssrc/utils/__tests__/for utility tests- Use
__snapshots__/for snapshot testing where appropriate
Coverage thresholds are enforced at 80% for branches, functions, lines, and statements.
When working with regexes:
- Use
regex-cache.tsfor frequently-used patterns - Global regexes must be cloned before returning (see
getGlobalRegex()) - This prevents race conditions in concurrent chunking scenarios
- Create
src/chunk/strategies/my-strategy.ts - Define
MyStrategyChunkMetadataextendingBaseChunkMetadata - Define
MyStrategyChunkingOptionsextendingBaseChunkingOptions - Implement
chunkByMyStrategy()function - Call
postProcessChunks()before returning - Add to
StrategyRegistryinchunk.ts - Add case to switch statement in
chunk()function - Create tests in
__tests__/my-strategy.test.ts
Post-processors transform chunks after main processing:
export function myProcessor(
chunk: Chunk,
index: number,
chunks: Chunk[]
): Chunk {
return {
...chunk,
metadata: {
...chunk.metadata,
customField: 'value'
}
};
}See src/chunk/post-processors/add-context-headers.ts for reference implementation.
To add custom metadata fields:
- Extend the strategy's metadata interface
- Populate fields before calling
postProcessChunks() - Ensure metadata is preserved through post-processing
Base metadata always includes: id, startIndex, endIndex, lines, previousChunkId, nextChunkId