A semantic search engine for machinery safety documents that combines vector similarity search with BM25 lexical search for improved retrieval accuracy.
semantic_search_engine/
├── api/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ └── qna_model.py # Pydantic models for API
├── chunk_db/
│ ├── __init__.py
│ ├── chunk_data.py # PDF processing and text chunking
│ ├── ingest_chunks.py # Chunk storage and vector embedding
│ └── ingest_into_db.py # Database ingestion pipeline
├── sourced_data/
│ └── *.pdf # Source PDF documents
├── hybrid_reranker/
│ ├── __init__.py
│ └── bm25_reranker.py # BM25 + vector hybrid search
├── utils/
│ ├── __init__.py
│ ├── common_utils.py # Commonly used utilities in this app
│ ├── download_source_data.py # Data download script
│ ├── normalize_scores.py # Score normalization utilities
│ └── retrieval_utils.py # Answer formatting and citations
├── vector_db/
│ ├── __init__.py
│ └── baseline_search.py # Vector similarity search
├── sources.json # Data source configuration
└── README.md
- Clone the repository:
git clone https://github.com/CodeStrate/semantic_search_engine.git
cd semantic_search_engine- Install dependencies:
pip install -r requirements.txtpython utils/download_source_data.pyThis downloads PDF documents specified in sources.json to the data/ directory.
python -m chunk_db.ingest_into_dbThis processes PDFs, extracts text with OCR cleaning, and creates text chunks with metadata.
python -m chunk_db.ingest_chunksThis generates embeddings using all-MiniLM-L6-v2 (it also downloads it if not available for ChromaDB) and stores them in ChromaDB for vector search.
python api/main.pyThe FastAPI server will start on http://localhost:8000
Submit a query and get an answer with citations.
Request Body:
{
"query": "What is OSHA?",
"k": 5,
"mode": "baseline"
}Parameters:
query(string, required): User's questionk(integer, optional): Number of chunks to retrieve (default: 5)mode(string, optional): Search mode -"baseline"or"hybrid-bm25"(default:"baseline")
Response:
{
"answer": "OSHA is the Occupational Safety and Health Administration...",
"contexts": [
[
{
"chunk_id": "123",
"src_id": "src01",
"title": "OSHA Guidelines",
"url": "https://example.com/osha.pdf",
"score": 0.95
}
],
[0.95, 0.87, 0.82]
],
"mode": "baseline"
}curl -X POST "http://localhost:8000/ask" \
-H "Content-Type: application/json" \
-d '{
"query": "What is OSHA?",
"k": 5,
"mode": "baseline"
}'curl -X POST "http://localhost:8000/ask" \
-H "Content-Type: application/json" \
-d '{
"query": "machinery safety regulations and compliance requirements",
"k": 10,
"mode": "hybrid-bm25"
}'curl -X POST "http://localhost:8000/ask" \
-H "Content-Type: application/json" \
-d '{
"query": "hello how are you today?",
"k": 5,
"mode": "baseline"
}'Test individual components or debug/ snippets:
# Test baseline search
python -m vector_db.baseline_search
# Test hybrid reranking
python -m hybrid_reranker.bm25_rerankerEach component is independently testable and can be run as a module using Python's -m flag.
While it seems easy (just chunk, embed and retrieve) it's not, a lot of time was spent on tuning the chunking functionality, especially when the PDFs are OCR based so the PDF text extraction is never perfect and given the limitations I had to spend a lot of test_chunking iterations just to find a good chunk spot. It is the same reason I chose to pursue Langchain's Chunking methodology with separators and overlaps to make sure my chunks are context aware on their own as we can't use a generative model (generation) here. While it works well, there are still some tricky queries that can stump my retriever. Chunking needs more testing and time. I saw how some harder queries didn't even return an answer due to my abstain filter. Which I had to increase.
If I could use generation then retrieval_utils and Regex based OCR cleaning is not required (at all). This project taught me the power of generative LLMs and how they can seemingly do anything that's told to. Given use-case doesn't require a Paid API though, I am sure a 1B Ollama or HuggingFace model would suffice. I also finally learned how to add citations in retrieval through metadata (from Chroma Docs). While I saw rank-bm25 was the most used bm25 implementation, I saw bm25s in a HuggingFace Blog and was fascinated by its lightweighted-ness. To further make the project lightweight I chose not to use many NLP based libraries which might have affected my result quality. Using Chroma and PyMuPDF were my personal choices given my previous experience using them. We could have definitely reduced some weight with FAISS but then SentenceTransformers is required which has a dependency on torch and other heavy dependencies. It didn't seem worth it.