A powerful semantic search engine built with FastAPI and VictorDB, featuring multilingual support and vector-based document retrieval.
This semantic search system allows you to:
- Ingest documents with metadata and automatically chunk them for optimal retrieval
- Perform semantic searches using natural language queries across multiple languages
- Retrieve similar documents based on meaning rather than just keyword matching
- Manage documents with full CRUD operations via REST API
The system uses state-of-the-art sentence transformers for embedding generation and VictorDB for efficient vector storage and retrieval.
- 🔍 Semantic Search: Advanced similarity search using sentence transformers
- 🌍 Multilingual Support: Built-in support for multiple languages
- 📄 Document Management: Full CRUD operations for documents
- 🚀 High Performance: Optimized chunking and vector indexing
- 🔧 RESTful API: Complete FastAPI-based REST interface
- ⚙️ Configurable: Environment-based configuration system
- 🐳 Production Ready: Built for scalability and reliability
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ FastAPI │ │ Sentence │ │ VictorDB │
│ REST API │───▶│ Transformers │───▶│ Vector Store │
│ │ │ (Embeddings) │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Document │ │ Text │ │ Vector │
│ Storage │ │ Chunking │ │ Indexing │
│ │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- Python: 3.9+
- VictorDB: Running instance with table and index services
- System Memory: Minimum 2GB RAM (4GB+ recommended)
- Storage: Depends on document volume
git clone https://github.com/emilianobilli/SemanticSearch
cd SemanticSearchpython3 -m venv env
source env/bin/activate # On macOS/Linux
# or
env\Scripts\activate # On Windowspip install -r requirements-api.txtYou need to have VictorDB running with both table and index services. Please refer to the VictorDB documentation for installation and setup instructions.
Typical VictorDB setup:
# Start table service
victord table --socket /var/lib/victord/semantic_table/socket.unix
# Start index service
victord index --socket /var/lib/victord/semantic_index/socket.unixCopy the example environment file and adjust the settings:
cp .env.example .env # If you have an example file
# or create .env manuallyEdit .env with your configuration:
# VictorDB Configuration
SEMANTIC_VICTOR_TABLE_SOCKET=/var/lib/victord/semantic_table/socket.unix
SEMANTIC_VICTOR_INDEX_SOCKET=/var/lib/victord/semantic_index/socket.unix
# API Configuration
SEMANTIC_API_HOST=0.0.0.0
SEMANTIC_API_PORT=8000
SEMANTIC_API_RELOAD=true
SEMANTIC_API_LOG_LEVEL=info
# CORS Configuration
SEMANTIC_CORS_ORIGINS=*
# Search Configuration
SEMANTIC_MAX_SEARCH_RESULTS=50
SEMANTIC_DEFAULT_SEARCH_RESULTS=10python api.pyThe API will be available at http://localhost:8000 with interactive documentation at http://localhost:8000/docs.
GET /healthPOST /document
Content-Type: application/json
{
"title": "Document Title",
"author": "Author Name",
"source": "https://example.com/article",
"published_at": "2024-01-01",
"raw_text": "Document content here...",
"metadata": ["tag1", "tag2"]
}POST /documents
Content-Type: application/json
[
{
"title": "Document 1",
"raw_text": "Content 1...",
...
},
{
"title": "Document 2",
"raw_text": "Content 2...",
...
}
]GET /search?q=your search query&top=10GET /documents/{document_id}DELETE /documents/{document_id}from search import SemanticSearch
from model import Document
from victordb import VictorTableClient, VictorIndexClient, VictorSession
# Initialize connections
table = VictorTableClient()
table.connect(unix_path="/path/to/table/socket.unix")
index = VictorIndexClient()
index.connect(unix_path="/path/to/index/socket.unix")
session = VictorSession(table)
semantic_search = SemanticSearch(session, index)
# Create and ingest a document
document = Document(
title="My Document",
author="Author Name",
raw_text="Document content here...",
metadata=["tag1", "tag2"]
)
success = semantic_search.ingest(document=document)
# Search for documents
results = semantic_search.search(query="search query", top=10)
# Get specific document
doc = semantic_search.retrieve(document_id=1)
# Delete document
success = semantic_search.delete(document_id=1)For testing and exploration, you can use the interactive search script:
python feed.pyThis will start an interactive session where you can type queries and see results immediately.
All configuration is handled through environment variables. Key settings include:
| Variable | Default | Description |
|---|---|---|
SEMANTIC_VICTOR_TABLE_SOCKET |
/tmp/victor_default_table.sock |
VictorDB table service socket |
SEMANTIC_VICTOR_INDEX_SOCKET |
/tmp/victor_default_index.sock |
VictorDB index service socket |
SEMANTIC_API_HOST |
0.0.0.0 |
API server host |
SEMANTIC_API_PORT |
8000 |
API server port |
SEMANTIC_MAX_SEARCH_RESULTS |
50 |
Maximum search results |
SEMANTIC_DEFAULT_SEARCH_RESULTS |
10 |
Default search results |
The system uses intelligent text chunking optimized for semantic search:
- Target Length: ~256 tokens per chunk
- Overlap: 40 tokens between chunks
- Model: Designed for
paraphrase-multilingual-MiniLM-L12-v2 - Title Prepending: Automatically prepends document title to chunks for better context
- Model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - Dimensions: 384
- Languages: Supports 50+ languages
- Normalization: L2 normalized embeddings for consistent similarity scoring
FROM python:3.9-slim
WORKDIR /app
COPY requirements-api.txt .
RUN pip install -r requirements-api.txt
COPY . .
EXPOSE 8000
CMD ["python", "api.py"]SEMANTIC_API_RELOAD=false
SEMANTIC_API_LOG_LEVEL=warning
SEMANTIC_CORS_ORIGINS=https://yourdomain.com,https://app.yourdomain.com- VictorDB: Can be scaled horizontally across multiple nodes
- API: Stateless design allows for easy horizontal scaling
- Memory: Embedding model requires ~1GB RAM per instance
- Storage: Vector index size depends on document volume
SemanticSearch/
├── api.py # FastAPI application and endpoints
├── search.py # Core semantic search logic
├── model.py # Data models (Document, DocumentChunk)
├── embed.py # Text embedding and chunking
├── settings.py # Configuration management
├── feed.py # Interactive CLI and utilities
├── requirements-api.txt # Python dependencies
├── .env # Environment configuration
└── README.md # This file
# Install test dependencies
pip install pytest pytest-asyncio
# Run tests
pytest# Install development dependencies
pip install black isort flake8
# Format code
black .
isort .
# Check style
flake8 .-
VictorDB Connection Failed
- Ensure VictorDB services are running
- Check socket paths in configuration
- Verify file permissions
-
Out of Memory
- Reduce chunk size or batch size
- Consider using a smaller embedding model
- Monitor memory usage during ingestion
-
Slow Search Performance
- Check VictorDB index configuration
- Consider reducing search result limits
- Monitor vector index size
-
Import Errors
- Ensure all dependencies are installed
- Check Python version compatibility
- Verify virtual environment activation
The application logs important events. To see detailed logs:
SEMANTIC_API_LOG_LEVEL=debug python api.py- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
For issues and questions:
- Create an issue in the repository
- Check the troubleshooting section
- Review VictorDB documentation for database-related issues
Built with ❤️ using FastAPI, VictorDB, and Sentence Transformers.