Vision-language OCR and multimodal document QA for images and PDFs.
Vlense helps you do two things well:
- extract structured or free-form content from images and PDFs with vision models
- build a page-level retrieval index over documents and ask grounded questions with citations
It is designed for workflows where plain OCR is not enough and the model needs to reason over full document pages, scans, tables, forms, and mixed visual layouts.
- OCR for images and PDFs with Markdown, HTML, or JSON output
- Pydantic schema support for structured extraction
- Page-image indexing for PDFs and image collections
- Multimodal retrieval with
colpali-engine - Grounded question answering over retrieved document pages
- Async Python API with a small surface area
Install the package:
uv add vlenseOr install from source in this repository:
uv syncPDF rendering uses pdf2image, so Poppler must be available on your system.
import asyncio
import os
from vlense import Vlense
async def main():
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
vlense = Vlense()
result = await vlense.ocr(
file_path=["./invoice.png", "./report.pdf"],
model="openai/gpt-5-mini",
format="markdown",
)
print(result["invoice.png"].content)
if __name__ == "__main__":
asyncio.run(main())import asyncio
import os
from vlense import Vlense
async def main():
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
vlense = Vlense()
await vlense.index(
data_dir="./handbook.pdf",
collection_name="company-docs",
index_dir="./.vlense",
retriever_model="vidore/colSmol-500M",
)
answer = await vlense.ask(
query="What are the eligibility requirements?",
collection_name="company-docs",
index_dir="./.vlense",
model="openai/gpt-5-mini",
top_k=3,
)
print(answer)
if __name__ == "__main__":
asyncio.run(main())Vlense.ask() returns a grounded answer based on the retrieved page images, with cited page references.
Vlense uses colpali-engine for page-image retrieval and defaults to vidore/colSmol-500M.
This gives you:
- document-aware visual retrieval instead of plain text-only chunking
- a smaller default retriever than the heavier ColQwen variants
- a local collection format that stores rendered pages plus embeddings for reuse
The repository includes a runnable example for PDF question answering:
uv run python examples/pdf_qa.py ./document.pdf \
--collection my-docs \
--question "What does the report say about pricing?" \
--vision-model openai/gpt-5-miniRuns OCR over one or more images or PDFs and returns generated content in Markdown, HTML, or JSON.
Key options:
file_path: single path or list of pathsmodel: vision-capable model nameformat:markdown,html, orjsonjson_schema: optional Pydantic schema for structured extractionoutput_dir: optional directory for persisted outputs
Builds a local multimodal retrieval collection from PDFs or images.
Key options:
data_dir: file path, list of paths, or directorycollection_name: logical name for the collectionindex_dir: storage root for page renders and embeddingsretriever_model:colpali-enginecheckpoint name
Searches an indexed collection, retrieves the most relevant pages, and asks a vision model to answer using those pages as evidence.
Key options:
query: user questioncollection_name: existing indexed collectionmodel: answer model such asopenai/gpt-5-minitop_k: number of retrieved pages to ground the answer
GitHub Actions runs CI on pushes and pull requests. Tagged releases publish to PyPI and create a GitHub Release.
Repository setup:
- add a repository secret named
PYPI_API_TOKEN
Release flow:
git tag v0.2.4
git push origin v0.2.4This repository uses uv, not pip.
Useful commands:
uv sync
uv run python -m unittest vlense.tests.test_vlense
uv buildIssues and pull requests are welcome.
MIT