do you have lots of pdfs that you want to search through visually? do you have an nvidia gpu? if yes, this is for you!
search.mp4
similar.mp4
- on the web gui with admin mode enabled, or via cli
-
Install Docker
-
Clone the repo:
git clone https://github.com/pdfiles/pdfiles.git cd pdfiles -
Configure:
cp .env.example .env # Edit .env — set DATA_PATH to your documents folder -
Start:
./pdfiles.sh up
On Windows:
pdfiles.bat up -
Open http://localhost
To stop: ./pdfiles.sh down (or pdfiles.bat down on Windows)
First startup downloads the search model (~4 GB) and takes 2-3 minutes.
| Command | Description |
|---|---|
./pdfiles.sh up |
Start services |
./pdfiles.sh update |
Pull latest images and restart |
./pdfiles.sh down |
Stop services |
./pdfiles.sh logs |
View logs |
./pdfiles.sh status |
Health check |
./pdfiles.sh backup |
Backup index to sqlite |
./pdfiles.sh restore DIR |
Restore from a backup |
./pdfiles.sh --help |
All commands |
Qdrant data is stored in a docker volume that will persist down and up, reboots, etc.
Updates should not affect this, which you can do with pdfiles update, but if you have spent a long time indexing files, it is always safest to pdfiles backup first.
You do not need to reindex files to update
- Docker and Docker Compose
- NVIDIA GPU (12+ GB VRAM) for indexing, or a prebuilt search index for CPU-only mode:
docker compose -f docker-compose.cpu.yml up -d
pdfiles scans through all of the pdfs that you mount in DATA_PATH and saves each page's essence in the form of vectors. when you type something in search, these words get turned into vectors; and then the two sets of vectors get compared. the resulting files are an ordered list of images that are closest to what you search. this is what enables "find similar photos" as well.
flowchart TD
PDF[PDF Pages] --> Bouncer --> Indexer
Indexer --> ColQwen2.5 --> Qdrant[(Qdrant)]
Query[Text / Image / Similar] --> ColQwen2.5
Qdrant -->|MaxSim| Results[Ranked Results]

