A semantic image search engine powered by OpenAI's CLIP (Contrastive Language-Image Pre-training) model and ChromaDB vector database. Search through your image collection using natural language queries.
This system enables intelligent image retrieval by understanding the semantic meaning of text queries. Unlike traditional keyword-based search, it comprehends concepts and finds visually relevant images even when exact keywords don't match.
Key Features:
- Semantic search using natural language descriptions
- Fast vector similarity search with ChromaDB
- Multiple interfaces: CLI, REST API, and Web UI
- Persistent vector database for instant queries
- Support for multiple CLIP model variants
- Batch processing for efficient encoding
- Image Encoding: Images are encoded into 512-dimensional vectors using CLIP's vision encoder
- Text Encoding: Search queries are encoded into the same vector space using CLIP's text encoder
- Similarity Search: ChromaDB performs fast nearest-neighbor search to find matching images
- Results: Returns images ranked by semantic similarity to the query
┌─────────────┐
│ Images │
└──────┬──────┘
│
▼
┌─────────────────┐
│ CLIP Encoder │ (Vision Model)
└────────┬────────┘
│
▼
┌─────────────┐
│ ChromaDB │ (Vector Storage)
└─────┬───────┘
│
▼
┌──────────────────┐
│ Text Query │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ CLIP Encoder │ (Text Model)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Similarity Search│
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Results │
└──────────────────┘
- Python 3.8 or higher
- CUDA-compatible GPU (optional, for faster processing)
- Clone the repository:
git clone https://github.com/hz01/Text-to-Image-Retrieval.git
cd Text-to-Image-Retrieval- Install dependencies:
pip install -r requirements.txt- Create the dataset folder:
mkdir dataset- Add your images to the
datasetfolder (supports: .jpg, .jpeg, .png, .bmp, .webp)
Before searching, you must encode your images and store them in the vector database:
python embed_images.pyThis process:
- Loads all images from the
datasetfolder - Encodes them using the CLIP model
- Stores embeddings in ChromaDB at
./chroma_db - Only needs to be run once (or when adding new images)
You have three options for searching:
python main.pyInteractive terminal interface with visualization:
- Enter natural language queries
- View results in matplotlib window
- Results saved to
resultsfolder
- Start the API server:
python api.py- Open
index.htmlin your browser
Features:
- Beautiful, modern UI
- Real-time search
- Adjustable result count
- Image preview modal
- Similarity scores
Start the server:
python api.pyAPI Endpoints:
Search Images
POST http://localhost:8000/search
Content-Type: application/json
{
"query": "a dog playing in the park",
"top_k": 5
}Health Check
GET http://localhost:8000/Response format:
{
"query": "a dog playing in the park",
"results": [
{
"image": "data:image/jpeg;base64,...",
"score": 0.85
}
],
"total_results": 5
}Edit config.py to customize settings:
# Paths
DATASET_FOLDER = "dataset" # Your images location
CHROMA_DB_PATH = "./chroma_db" # Vector database path
RESULTS_FOLDER = "results" # Search results output
# Model Configuration
MODEL_NAME = "ViT-B/32" # CLIP model variant
# Options:
# - "ViT-L/14" → Best quality, slower
# - "ViT-B/32" → Balanced (default)
# - "ViT-B/16" → Good quality
# - "RN50" → Fastest
# Processing
BATCH_SIZE = 32 # Reduce if out of memory
TOP_K = 5 # Default results count
# Database
COLLECTION_NAME = "image_embeddings"
WIPE_DB_ON_EMBED = True # Clear DB before embeddingText-to-Image-Retrieval/
├── image_retrieval.py # Core retrieval engine class
├── config.py # Configuration settings
├── embed_images.py # Image embedding script
├── main.py # CLI interface
├── api.py # Flask REST API server
├── index.html # Web UI
├── requirements.txt # Python dependencies
├── dataset/ # Your images (create this)
├── chroma_db/ # Vector database (auto-created)
└── results/ # Search results (auto-created)
CLIP (Contrastive Language-Image Pre-training) is a neural network trained on 400 million image-text pairs. It learns to map images and text into a shared embedding space where semantically similar concepts have similar vector representations.
Key Capabilities:
- Zero-shot learning: understands concepts without explicit training
- Multimodal: processes both images and text
- Semantic understanding: matches concepts, not just keywords
- Transfer learning: generalizes across domains
ChromaDB is an open-source vector database optimized for:
- Fast similarity search using HNSW (Hierarchical Navigable Small World) algorithm
- Persistent storage on disk
- Cosine similarity for normalized embeddings
- Efficient batch operations
Solution: Run python embed_images.py first to create the database
Solution: Reduce BATCH_SIZE in config.py (try 16 or 8)
Solutions:
- Try a different CLIP model (ViT-L/14 for better quality)
- Ensure images are relevant to your queries
- Use more descriptive queries
Solution: Check if port 8000 is available or change the port in api.py
Core libraries:
torch- PyTorch deep learning frameworkclip- OpenAI's CLIP modelchromadb- Vector databaseflask- REST API serverPillow- Image processingmatplotlib- Visualization
See requirements.txt for complete list.
This project is licensed under the MIT License - see the LICENSE file for details.
This project uses OpenAI's CLIP model, which is also under the MIT License. See the CLIP repository for more information.
- OpenAI for the CLIP model
- ChromaDB team for the vector database
- PyTorch community
Built with CLIP + ChromaDB for intelligent image retrieval