🧩 Chunklet-py

“One library to split them all: Sentence, Code, Docs”

Warning

Quick heads up! Version 2 has some breaking changes. No worries though - check our Migration Guide for a smooth upgrade!

Hey! Welcome. Let's make some text chunking magic happen.

-- documentation site --

Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

You could split your text by character count or random line breaks. But that's like trying to cut a wedding cake with a chainsaw. 🎂

Dumb splitting causes problems:

Mid-sentence surprises: Your thoughts get chopped mid-way, losing all meaning
Language confusion: Non-English text and code structures get treated the same
Lost context: Each chunk forgets what came before

Smart chunking solves this by:

Smart limits — Respects both natural boundaries (sentences, paragraphs, sections) AND configurable limits (tokens, lines, functions)
Language-aware — Detects language automatically and applies the right rules (50+ languages supported)
Context preservation — Overlap between chunks, rich metadata (source, span, document structure)

🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)

Chunklet-py is a developer-friendly text splitting library designed to be the most versatile chunking solution — for devs, researchers, and AI engineers. It goes way beyond basic character counting. I built this because I was tired of terrible chunking options. Chunklet-py intelligently chunks text, documents, and code into meaningful, context-aware pieces — perfect for RAG pipelines and LLM applications.

Key features:

Composable constraints — Mix and match limits (sentences, tokens, sections) to get exactly the chunks you need
Pluggable architecture — Swap in custom tokenizers, sentence splitters, or processors
Rich metadata — Every chunk comes with source references, spans, and structural info
Multi-format support — PDF, DOCX, EPUB, Markdown, HTML, LaTeX, ODT, CSV, Excel, and plain text

Available tools:

SentenceSplitter — Lightweight sentence tokenization
DocumentChunker — Natural language with semantic boundaries
CodeChunker — Language-aware code chunking
ChunkVisualizer — Interactive web-based exploration

Perfect for prepping data for LLMs, building RAG systems, or powering AI search - Chunklet-py gives you the precision and flexibility you need across tons of formats and languages.

Feature	Why it's awesome
🚀 Blazingly Fast	Leverages efficient parallel processing to chunk large volumes of content with remarkable speed.
🪶 Featherlight Footprint	Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead.
🗂️ Rich Metadata for RAG	Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications.
🔧 Infinitely Customizable	Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors.
🌐 Multilingual Mastery	Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms.
🧑‍💻 Code-Aware Intelligence	Language-agnostic code chunking that understands and preserves the structural integrity of your source code.
🎯 Precision Chunking	Flexible chunking with configurable limits based on sentences, tokens, sections, lines, and functions.
📄 Document Format Mastery	Processes a wide array of document formats including `.pdf`, `.docx`, `.epub`, `.txt`, `.tex`, `.html`, `.hml`, `.md`, `.rst`, `.rtf`, `.odt`, `.csv`, and `.xlsx`.
💻 Triple Interface: CLI, Library & Web	Use it as a command-line tool, import as a library for deep integration, or launch the interactive web visualizer for real-time chunk exploration and parameter tuning.

And that's just the start - there's plenty more to explore!

Note

For the full documentation experience, check out our documentation site.

📦 Installation

Ready to get Chunklet-py running? Awesome! Let's get you set up quickly and painlessly.

Note

chunklet-py (aka chunklet) — The old chunklet package is no longer maintained. Use chunklet-py to get the latest version.

The Quick & Easy Way

The simplest way to get started is with pip:

# Install and check it's working
pip install chunklet-py
chunklet --version

That's it! You're all set to start chunking.

Extra Features (Optional)

Want to unlock more Chunklet-py superpowers? Add these optional dependencies based on what you need:

Document Processing: For handling .pdf, .docx, .epub, and other document formats:
```
pip install "chunklet-py[structured-document]"
```
Code Chunking: For advanced code analysis and chunking features:
```
pip install "chunklet-py[code]"
```
Visualization: For the interactive web-based chunk visualizer:
```
pip install "chunklet-py[visualization]"
```
All Extras: To install all optional dependencies:
```
pip install "chunklet-py[all]"
```

The From-Source Way

Prefer building from source? You can clone and install manually for full control:

git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
pip install .[all]

(But honestly, the pip way is usually way easier!)

Want to Help Make Chunklet-py Even Better?

That's awesome! We'd love to have you contribute. Check out our Contributing Guide first, then set up your development environment:

git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
# For basic development (testing, linting)
pip install -e ".[dev]"
# For documentation development
pip install -e ".[docs]"
# For comprehensive development (including all optional features like document and code chunking + docs dependencies)
pip install -e ".[dev-all]"

These install Chunklet-py in "editable" mode so your code changes take effect immediately. The different options give you just the dependencies you need.

Go forth and code! (And remember, good developers write tests. We appreciate excellent code examples!)

Quick Reference 🛠️

Note

For the exhaustive details that I know you're probably avoiding, check the official docs.

The Constraint-Based Logic

Chunklet-py is basically a "choose your own adventure" for data. It's constraint-based, meaning you can swap, combine, or ignore the limits below as you see fit.

The Golden Rule: You must provide at least one constraint, or the chunker has no idea when to stop.

Core Imports

Pick your weapon based on whatever data mess you're currently cleaning up.

from chunklet import DocumentChunker   # For PDFs, DOCX, and general text chaos
from chunklet import CodeChunker       # For source code (it actually respects brackets)
from chunklet import SentenceSplitter  # For when you just need to split sentences
from chunklet import visualizer        # Web-based chunk visualizer

Configuration & Limits

These tools don't share arguments, so don't try to use max_functions on a PDF unless you want to see a very confused Python interpreter.

DocumentChunker (Text & Docs)

Perfect for natural language where you don't want to cut someone off mid-sentence.

chunker = DocumentChunker()

# Feel free to mix and match these
chunks = chunker.chunk_text(
    text,
    max_sentences=3,       # Stop after X sentences
    max_tokens=500,        # Don't blow up the LLM context
    max_section_breaks=2,  # Respect the Markdown headers
    overlap_percent=20,    # Give it some "memory" of the last chunk
    offset=0               # Skip the first N sentences if you're feeling adventurous
)

CodeChunker (Source Code)

Logic-aware. It doesn't do "overlap" because duplicate code is a hallucination waiting to happen.

chunker = CodeChunker()

# Again, use whichever constraints make sense for your file
chunks = chunker.chunk_text(
    text,
    max_lines=50,          # Height limit
    max_tokens=512,        # Width limit
    max_functions=1,       # One function per chunk (keeps things tidy)
    strict=True            # True: Crash on big blocks; False: Slice 'em up anyway
)

The Output Object

The chunkers return a list (or generator) of Chunk objects. These are Box instances, so you can use dot notation like a civilized developer.

for chunk in chunks:
    print(chunk.content)   # The actual text/code
    print(chunk.metadata)  # Chunk metadata
    print()                # Because whitespace is free

Input Methods (Chunkers Only)

These helper methods are for the DocumentChunker and CodeChunker. The SentenceSplitter is a simple soul and only takes strings.

Method	Input	Return Type
`chunk_text(text)`	str	List[Chunk]
`chunk_file(path)`	Path or str	List[Chunk]
`chunk_texts(list)`	List[str]	Generator[Chunk]
`chunk_files(list)`	List[Path]	Generator[Chunk]

Specialized Tools

SentenceSplitter

The "lite" version for when you just need sentences and no fancy metadata.

splitter = SentenceSplitter()

# 'auto' usually guesses right, but you can specify 'en', 'es', etc.
sentences = splitter.split_text(text, lang="auto")

CLI (Command Line Interface)

If you prefer the terminal to an IDE, the CLI is packed with features. Just ask for help.

chunklet --help
chunklet split --help
chunklet chunk --help
chunklet visualize --help
chunklet [COMMAND] [OPTIONS*]

🗺 Features & Roadmap

How Chunklet-py Compares

While there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:

Library	Key Differentiator	Focus
chunklet-py	All-in-one, lightweight, multilingual, language-agnostic with specialized algorithms.	Text, Code, Docs
LangChain	Full LLM framework with basic splitters (e.g., RecursiveCharacterTextSplitter, Markdown, HTML, code splitters). Good for prototyping but basic for complex docs or multilingual needs.	Full Stack
Chonkie	All-in-one pipeline (chunking + embeddings + vector DB). Uses `tree-sitter` for code. Multilingual.	Pipelines
Semchunk	Text-only, fast semantic splitting. Built-in tiktoken/HuggingFace support. 85% faster than alternatives.	Text
CintraAI Code Chunker	Code-specific, uses `tree-sitter`. Initially supports Python, JS, CSS only.	Code

Chunklet-py is a specialized, drop-in replacement for the chunking step in any RAG pipeline. It handles text, documents, and code without heavy dependencies, while keeping your project lightweight.

🙌 Contributors & Thanks

A huge thank you to the awesome people who helped shape Chunklet-py:

@jmbernabotto — for helping mostly on the CLI part, suggesting fixes, features, and design improvements.
@arnoldfranz — for reporting the CLI Path Validation Bug (#6) that helped improve error handling.

📜 License

Check out the LICENSE file for all the details.

MIT License. Use freely, modify boldly, and credit appropriately! (We're not that legendary... yet 😉)

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github/workflows		.github/workflows
docs		docs
samples		samples
src/chunklet		src/chunklet
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
audit_migration.py		audit_migration.py
build_docs.sh		build_docs.sh
logo_with_tagline.png		logo_with_tagline.png
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧩 Chunklet-py

Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)

📦 Installation

The Quick & Easy Way

Extra Features (Optional)

The From-Source Way

Want to Help Make Chunklet-py Even Better?

Quick Reference 🛠️

The Constraint-Based Logic

Core Imports

Configuration & Limits

The Output Object

Input Methods (Chunkers Only)

Specialized Tools

🗺 Features & Roadmap

How Chunklet-py Compares

🙌 Contributors & Thanks

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧩 Chunklet-py

Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)

📦 Installation

The Quick & Easy Way

Extra Features (Optional)

The From-Source Way

Want to Help Make Chunklet-py Even Better?

Quick Reference 🛠️

The Constraint-Based Logic

Core Imports

Configuration & Limits

The Output Object

Input Methods (Chunkers Only)

Specialized Tools

🗺 Features & Roadmap

How Chunklet-py Compares

🙌 Contributors & Thanks

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages