Skip to content

daitangio/find

Repository files navigation

So What?

Stop searching, start finding stuff

Pylint

Find is a super-minimal search engine based on SQLite Full Text Search capabilities and Python. It is composed of two commands:

Features

  • Find supports caching of web pages (a lost feature of Google) and de-duplication if content is the same for some pages.Back link ranking tuning is in progress
  • Respects robots.txt

How to start

Create a virtualenv and install the project:

    python3 -m venv .venv
    . .venv/bin/activate
    pip install -e .

Run your first crawl:

crawl --seed https://myhost.com --same-host 

Run the web interface with:

findgui

Why

I need to design a small search engine for my static web site. I asked to ChatGPT 5.2 to design it, then I refined the code. Initial prompt was

Design a small python web application to implement a search engine. 
The search must be performed on a SQLite database using 
the SQLite Full Text Search (FTS5) extension. 
Design the database model to be able to store simple html web pages.

Design principles

Find is a compact,zero-conf & tiny solution to add a search engine to a pre-existing blog site. It just works out of the box.

As a basic rule I will try to keep it below 2000 lines of code.

The project accepts pull requests: please open it adding a comment. Ensure the change passes the pylint checks.

How

SQLite has a full text search capability called FTS5 which offers out of the box also stemming for english language.

ChatGPT for the crawler proposed asyncio I/O (aiohttp & aiosqlite libraries), which is a very good approach to scale the crawler: downloading web pages is a very I/O bound activity and it benefits from a non-blocking library.

Initial implementation has a locking problem: we solved it with a mono-writer database task. SQLite is so fast you have an hard time to tune the writer queue: it is very difficult to saturate it. To avoid data loss, I opted for a queue 4x the concurrency level.

The crawler has a default delay to avoid overloading the target site. For this reason, it is pointless to have too much concurrency if your default delay is high.

The overall project aims to be very compact (less is more mantra)

Utility commands

reindex

The reindex command can be used to re-index the database

Change Log

0.0.3-TO BE

  • Better handling of search queries

0.0.2 Rate Limit

This version features:

0.0.1 First implementation

  • Search + index core
  • docker compose

Next Step and Roadmap

  1. The links table is collected but not used on the search right now. The idea is to use it to refine the PageRank. To have an idea try:

    SELECT p.url, COUNT(*) AS out_links
    FROM links l JOIN pages p ON p.id = l.from_page_id
    GROUP BY p.id
    ORDER BY out_links DESC
    LIMIT 20;
  2. Ability to classify categories and tags on the full text search can be useful for faceting and classification. "Auto discovery" of the taxonomies can be further idea

Docker compose and auto-index mode

Be happy!

About

Python + SQLite search engine

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors

Languages