So What?

Stop searching, start finding stuff

Find is a super-minimal search engine based on SQLite Full Text Search capabilities and Python. It is composed of two commands:

A Simple web crawler which uses asyncio to maximize index ingestion speed.
A Flask app to enable end-users to find things.

Features

Find supports caching of web pages (a lost feature of Google) and de-duplication if content is the same for some pages.Back link ranking tuning is in progress
Respects robots.txt

How to start

Create a virtualenv and install the project:

    python3 -m venv .venv
    . .venv/bin/activate
    pip install -e .

Run your first crawl:

crawl --seed https://myhost.com --same-host

Run the web interface with:

findgui

Why

I need to design a small search engine for my static web site. I asked to ChatGPT 5.2 to design it, then I refined the code. Initial prompt was

Design a small python web application to implement a search engine. 
The search must be performed on a SQLite database using 
the SQLite Full Text Search (FTS5) extension. 
Design the database model to be able to store simple html web pages.

Design principles

Find is a compact,zero-conf & tiny solution to add a search engine to a pre-existing blog site. It just works out of the box.

As a basic rule I will try to keep it below 2000 lines of code.

The project accepts pull requests: please open it adding a comment. Ensure the change passes the pylint checks.

How

SQLite has a full text search capability called FTS5 which offers out of the box also stemming for english language.

ChatGPT for the crawler proposed asyncio I/O (aiohttp & aiosqlite libraries), which is a very good approach to scale the crawler: downloading web pages is a very I/O bound activity and it benefits from a non-blocking library.

Initial implementation has a locking problem: we solved it with a mono-writer database task. SQLite is so fast you have an hard time to tune the writer queue: it is very difficult to saturate it. To avoid data loss, I opted for a queue 4x the concurrency level.

The crawler has a default delay to avoid overloading the target site. For this reason, it is pointless to have too much concurrency if your default delay is high.

The overall project aims to be very compact (less is more mantra)

Utility commands

reindex

The reindex command can be used to re-index the database

Change Log

0.0.3-TO BE

Better handling of search queries

0.0.2 Rate Limit

This version features:

Reindex
robots.txt specification implementation
flask rate limiter and minimal DDoS protection: https://flask-limiter.readthedocs.io/en/stable/
bugfix on slow sites (>1 second response time)

0.0.1 First implementation

Search + index core
docker compose

Next Step and Roadmap

The links table is collected but not used on the search right now. The idea is to use it to refine the PageRank. To have an idea try:

SELECT p.url, COUNT(*) AS out_links
FROM links l JOIN pages p ON p.id = l.from_page_id
GROUP BY p.id
ORDER BY out_links DESC
LIMIT 20;

Ability to classify categories and tags on the full text search can be useful for faceting and classification. "Auto discovery" of the taxonomies can be further idea

Docker compose and auto-index mode

Be happy!

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
.vscode		.vscode
etc		etc
src/find		src/find
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pylintrc		.pylintrc
DEVELOPING.md		DEVELOPING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

So What?

Features

How to start

Why

Design principles

How

Utility commands

reindex

Change Log

0.0.3-TO BE

0.0.2 Rate Limit

0.0.1 First implementation

Next Step and Roadmap

Docker compose and auto-index mode

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

So What?

Features

How to start

Why

Design principles

How

Utility commands

reindex

Change Log

0.0.3-TO BE

0.0.2 Rate Limit

0.0.1 First implementation

Next Step and Roadmap

Docker compose and auto-index mode

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages