PDfiles

do you have lots of pdfs that you want to search through visually? do you have an nvidia gpu? if yes, this is for you!

Features

search through pdfs by text description (not OCR)

search.mp4

search for all photos similar to a particular photo

similar.mp4

reverse image search — upload a photo to find similar pages

can use OPT files to speed up indexing

can export your index files to backup or share with others

on the web gui with admin mode enabled, or via cli

Quick Start

Install Docker

Clone the repo:

git clone https://github.com/pdfiles/pdfiles.git
cd pdfiles

Configure:

cp .env.example .env
# Edit .env — set DATA_PATH to your documents folder

Start:
```
./pdfiles.sh up
```
On Windows: pdfiles.bat up
Open http://localhost

To stop: ./pdfiles.sh down (or pdfiles.bat down on Windows)

First startup downloads the search model (~4 GB) and takes 2-3 minutes.

Usage

Command	Description
`./pdfiles.sh up`	Start services
`./pdfiles.sh update`	Pull latest images and restart
`./pdfiles.sh down`	Stop services
`./pdfiles.sh logs`	View logs
`./pdfiles.sh status`	Health check
`./pdfiles.sh backup`	Backup index to sqlite
`./pdfiles.sh restore DIR`	Restore from a backup
`./pdfiles.sh --help`	All commands

A note on updates

Qdrant data is stored in a docker volume that will persist down and up, reboots, etc. Updates should not affect this, which you can do with pdfiles update, but if you have spent a long time indexing files, it is always safest to pdfiles backup first.

You do not need to reindex files to update

Requirements

Docker and Docker Compose
NVIDIA GPU (12+ GB VRAM) for indexing, or a prebuilt search index for CPU-only mode:
```
docker compose -f docker-compose.cpu.yml up -d
```

How it works

pdfiles scans through all of the pdfs that you mount in DATA_PATH and saves each page's essence in the form of vectors. when you type something in search, these words get turned into vectors; and then the two sets of vectors get compared. the resulting files are an ordered list of images that are closest to what you search. this is what enables "find similar photos" as well.

Architecture

flowchart TD
    PDF[PDF Pages] --> Bouncer --> Indexer
    Indexer --> ColQwen2.5 --> Qdrant[(Qdrant)]

    Query[Text / Image / Similar] --> ColQwen2.5
    Qdrant -->|MaxSim| Results[Ranked Results]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docker		docker
src/pdfiles		src/pdfiles
tests		tests
tools		tools
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
Dockerfile.index		Dockerfile.index
README.md		README.md
REFERENCE.md		REFERENCE.md
docker-compose.cpu.yml		docker-compose.cpu.yml
docker-compose.yml		docker-compose.yml
pdfiles.bat		pdfiles.bat
pdfiles.sh		pdfiles.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDfiles

Features

search through pdfs by text description (not OCR)

search for all photos similar to a particular photo

reverse image search — upload a photo to find similar pages

can use OPT files to speed up indexing

can export your index files to backup or share with others

Quick Start

Usage

A note on updates

Requirements

How it works

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

bcherb2/pdfiles

Folders and files

Latest commit

History

Repository files navigation

PDfiles

Features

search through pdfs by text description (not OCR)

search for all photos similar to a particular photo

reverse image search — upload a photo to find similar pages

can use OPT files to speed up indexing

can export your index files to backup or share with others

Quick Start

Usage

A note on updates

Requirements

How it works

Architecture

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages