Stop searching, start finding stuff
Find is a super-minimal search engine based on SQLite Full Text Search capabilities and Python. It is composed of two commands:
- A Simple web crawler which uses asyncio to maximize index ingestion speed.
- A Flask app to enable end-users to find things.
- Find supports caching of web pages (a lost feature of Google) and de-duplication if content is the same for some pages.Back link ranking tuning is in progress
- Respects robots.txt
Create a virtualenv and install the project:
python3 -m venv .venv
. .venv/bin/activate
pip install -e .Run your first crawl:
crawl --seed https://myhost.com --same-host
Run the web interface with:
findgui
I need to design a small search engine for my static web site. I asked to ChatGPT 5.2 to design it, then I refined the code. Initial prompt was
Design a small python web application to implement a search engine.
The search must be performed on a SQLite database using
the SQLite Full Text Search (FTS5) extension.
Design the database model to be able to store simple html web pages.
Find is a compact,zero-conf & tiny solution to add a search engine to a pre-existing blog site. It just works out of the box.
As a basic rule I will try to keep it below 2000 lines of code.
The project accepts pull requests: please open it adding a comment. Ensure the change passes the pylint checks.
SQLite has a full text search capability called FTS5 which offers out of the box also stemming for english language.
ChatGPT for the crawler proposed asyncio I/O (aiohttp & aiosqlite libraries), which is a very good approach to scale the crawler: downloading web pages is a very I/O bound activity and it benefits from a non-blocking library.
Initial implementation has a locking problem: we solved it with a mono-writer database task. SQLite is so fast you have an hard time to tune the writer queue: it is very difficult to saturate it. To avoid data loss, I opted for a queue 4x the concurrency level.
The crawler has a default delay to avoid overloading the target site. For this reason, it is pointless to have too much concurrency if your default delay is high.
The overall project aims to be very compact (less is more mantra)
The reindex command can be used to re-index the database
- Better handling of search queries
This version features:
- Reindex
- robots.txt specification implementation
- flask rate limiter and minimal DDoS protection: https://flask-limiter.readthedocs.io/en/stable/
- bugfix on slow sites (>1 second response time)
- Search + index core
- docker compose
-
The links table is collected but not used on the search right now. The idea is to use it to refine the PageRank. To have an idea try:
SELECT p.url, COUNT(*) AS out_links FROM links l JOIN pages p ON p.id = l.from_page_id GROUP BY p.id ORDER BY out_links DESC LIMIT 20;
-
Ability to classify categories and tags on the full text search can be useful for faceting and classification. "Auto discovery" of the taxonomies can be further idea
Be happy!