Transform PDFs into clean Markdown and chunk them right β inspect, edit, and export for RAG.
If you like this project, a star βοΈ would mean a lot and keep you updated on new features :)
Chunky is a local, open-source tool that makes chunk validation the first-class citizen it should be in any RAG pipeline. Before you index a single vector, Chunky lets you see exactly what your chunks look like β and fix what's wrong.
The core workflow is simple: bring your document as a PDF or an existing Markdown file, pick a conversion strategy and a chunking strategy, and inspect every chunk side-by-side with the source. If something looks off, edit it directly in the UI. Only when the chunks are clean do you export them for indexing.
π§ Chunky is in early development and actively evolving β new features and improvements are on the way!
Chunking is one of the most underestimated steps in a RAG pipeline. As NVIDIA's research shows in Finding the Best Chunking Strategy for Accurate AI Responses, no single strategy wins universally β the right choice depends on content type and query characteristics, and poor chunking directly degrades retrieval quality and answer coherence. Chunking is not a set-and-forget parameter, yet most tools give you zero visibility into what your chunks actually look like. That's the gap Chunky fills.
New to this space? Check out Agentic RAG for Dummies β a hands-on implementation of Agentic RAG, the natural evolution of a basic RAG system.
- Bring your document β upload a PDF and let Chunky convert it to Markdown, or upload a Markdown file directly if you already have one. Your existing conversion is never overwritten.
- Choose your converter β pick the PDF-to-Markdown engine that best fits your document type. Not happy with the result? Switch converter and re-run without losing your work.
- Validate the Markdown β review the converted text side-by-side with the original PDF before chunking. Catch conversion artifacts early.
- Chunk and inspect β choose a splitting library and strategy, and see every chunk color-coded and enumerable. Spot boundaries that are too aggressive or too loose at a glance.
- Edit and fix β click any chunk to edit its content directly. No need to re-run the whole pipeline to fix one bad split.
- Export β save clean, validated chunks as timestamped JSON, ready to feed into your vector store.
- π Side-by-side PDF + Markdown viewer with synchronized scrolling
- β¨ Four PDF β Markdown converters (PyMuPDF, Docling, MarkItDown, VLM) β skipped if you upload an existing Markdown file
- π Re-convert on the fly β switch converter and regenerate Markdown without restarting the pipeline
- βοΈ Two splitting libraries β LangChain (4 strategies) and Chonkie (8 strategies)
- π¨ Color-coded chunk visualization with per-chunk editing
- π Pluggable, decorator-based architecture β add a new converter or splitter in minutes with zero frontend changes
- πΎ Export chunks as timestamped JSON, ready for indexing
- π‘ Dynamic
/api/capabilitiesendpoint β the frontend discovers all available converters and strategies automatically at startup
Chunky ships with four converters out of the box. You can switch between them in the UI at any time and re-convert the document without losing your chunking settings.
| Converter | Library | Best for |
|---|---|---|
| PyMuPDF (default) | pymupdf4llm |
Fast conversion of standard digital PDFs with selectable text |
| Docling | docling |
Complex layouts: multi-column documents, tables, and figures |
| MarkItDown | markitdown[all] |
Broad-format documents, simple and deterministic output |
| VLM | openai + any vision model |
Scanned PDFs, handwriting, diagrams β anything a human can read |
The VLM converter rasterises each page at 300 DPI and sends it to any OpenAI-compatible vision model. By default it targets a locally running Ollama instance β no API key, no internet access required.
# Default β Ollama (local, no API key needed)
VLMConverter()
# Different local model
VLMConverter(model="minicpm-v")
# OpenAI
VLMConverter(model="gpt-4o", base_url="https://api.openai.com/v1", api_key="sk-...")
# Google Gemini
VLMConverter(
model="gemini-2.5-flash",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key="AIza...",
)VLM conversions report per-page progress, which the frontend polls via GET /api/convert-progress/{filename}.
Chunky supports two splitting libraries, each exposing multiple strategies. The library and strategy are selected independently in the UI.
Install: pip install langchain-text-splitters tiktoken
| Strategy | Description |
|---|---|
| Token | Splits on token boundaries via tiktoken. Ideal for LLM context-window management. |
| Recursive | Tries paragraph β sentence β word boundaries in order. |
| Character | Splits on \n\n paragraphs, falls back to chunk_size characters. |
| Markdown | Two-phase split: H1/H2/H3 headers first, then optional size cap via RecursiveCharacterTextSplitter (activate with enable_markdown_sizing). |
Install: pip install chonkie[all]
| Strategy | Description |
|---|---|
| Token | Splits on token boundaries. Fast, no external tokeniser needed. |
| Fast | SIMD-accelerated byte-based chunking at 100+ GB/s. Best for high-throughput pipelines. |
| Sentence | Splits at sentence boundaries. Preserves semantic completeness. |
| Recursive | Recursively splits using structural delimiters (paragraphs β sentences β words). Note: chunk_overlap is not supported. |
| Table | Splits large Markdown tables by row while preserving headers. Ideal for tabular data. |
| Code | Splits source code using AST-based structural analysis. Supports multiple languages. |
| Semantic | Groups content by embedding similarity. Best for preserving topical coherence. |
| Neural | Uses a fine-tuned BERT model to detect semantic shifts. Great for topic-coherent chunks. |
Note: The Semantic and Neural strategies download ML models on first use and may be slow to initialise.
The converter and splitter layers are designed to be extended with minimal boilerplate. Both use a decorator-based registry: adding a new converter or splitter strategy automatically exposes it through the /api/capabilities endpoint and the UI β no frontend changes needed.
Every converter inherits from PDFConverter (backend/converters/base.py):
from abc import ABC, abstractmethod
from pathlib import Path
class PDFConverter(ABC):
@abstractmethod
def convert(self, pdf_path: Path) -> str:
"""Convert a PDF to a Markdown string."""
def validate_path(self, pdf_path: Path) -> None:
if not pdf_path.exists():
raise FileNotFoundError(f"PDF file not found: {pdf_path}")To add a new converter, you only need to do two things:
1. Create a new file in backend/converters/ and decorate the class
# backend/converters/my_converter.py
from pathlib import Path
from backend.registry import register_converter
from .base import PDFConverter
@register_converter(
name="my_converter",
label="My Converter",
description="Short description shown in the UI.",
)
class MyConverter(PDFConverter):
def __init__(self) -> None:
# Lazy imports are encouraged to avoid slowing down startup
from my_library import MyParser
self._parser = MyParser()
def convert(self, pdf_path: Path) -> str:
self.validate_path(pdf_path)
return self._parser.to_markdown(str(pdf_path))2. Import it in capabilities_router.py
import backend.converters.my_converter # noqa: F401 β side-effect importDone. The new converter appears automatically in /api/capabilities and the UI.
Every splitter inherits from TextSplitter (backend/splitters/base.py). Strategies are individual methods decorated with @register_splitter:
# Inside an existing or new TextSplitter subclass
from backend.registry import register_splitter
@register_splitter(
library="my_lib",
library_label="My Library",
strategy="my_strategy",
label="My Strategy",
description="Short description shown in the UI.",
)
def _split_my_strategy(self, request: ChunkRequest) -> List[ChunkItem]:
# your splitting logic here
splits = my_splitter.split(request.content, request.chunk_size)
return self.build_chunks(request.content, splits, request.chunk_overlap)Import the module in capabilities_router.py and add the strategy to the splitter's _DISPATCH table. The strategy will appear in the UI automatically.
docs/
pdfs/ # uploaded PDF files
mds/ # converted / uploaded Markdown files
chunks/
<stem>/ # one directory per document
<stem>_<UTC-ISO8601>.json # timestamped chunk exports
Chunk files are stored in a normalised enriched format with placeholder fields for CleanedChunk, Title, Context, Summary, Keywords, and Questions, ready for a downstream enrichment pipeline.
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/capabilities |
All registered converters and splitter strategies |
GET |
/api/documents |
List all available documents |
GET |
/api/document/{filename} |
Metadata + Markdown content for a document |
GET |
/api/pdf/{filename} |
Serve a PDF for inline viewing |
POST |
/api/upload |
Upload one or more PDF / Markdown files |
POST |
/api/convert/{filename} |
Convert a PDF to Markdown |
GET |
/api/convert-progress/{filename} |
Poll VLM conversion progress |
POST |
/api/md-to-pdf/{filename} |
Convert Markdown back to PDF |
DELETE |
/api/documents |
Delete one or more documents and derived files |
POST |
/api/chunk |
Split text into chunks |
POST |
/api/chunks/save |
Persist a chunk set to disk |
GET |
/api/chunks/load/{filename} |
Load the latest chunk set for a document |
Full interactive documentation is available at http://localhost:8000/docs.
There are two ways to run Chunky: locally or with Docker.
git clone https://github.com/GiovanniPasq/chunky.git
cd chunky
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./start_all.shgit clone https://github.com/GiovanniPasq/chunky.git
cd chunky
docker compose up --build| Service | URL |
|---|---|
| Frontend | http://localhost:5173 |
| Backend | http://localhost:8000 |
| Swagger | http://localhost:8000/docs |
Contributions are welcome β feel free to open an issue or submit a PR!

