diff --git a/README.md b/README.md index 8ff2c40..f77769e 100644 --- a/README.md +++ b/README.md @@ -9,19 +9,19 @@ CrossVector provides a consistent, high-level API across multiple vector databases (AstraDB, ChromaDB, Milvus, PgVector) and embedding providers (OpenAI, Gemini), allowing you to switch between backends without rewriting your application code. -## ๐ŸŽฏ Recommended Backends +## Recommended Backends Based on our comprehensive benchmarking, we recommend: ### **For Production:** -- **๐Ÿฅ‡ ChromaDB Cloud** - Best for cloud deployments +- **ChromaDB Cloud** - Best for cloud deployments - Hosted solution with excellent performance - Easy setup and management - Built-in scaling and backups - Good for: SaaS applications, MVPs, rapid prototyping -- **๐Ÿฅˆ PgVector** - Best for self-hosted/on-premise +- **PgVector** - Best for self-hosted/on-premise - Excellent performance (6-10 docs/sec bulk insert) - Very fast metadata queries (<1ms) - PostgreSQL reliability and ecosystem @@ -38,14 +38,21 @@ See our [benchmarking guide](docs/benchmarking.md) for detailed performance comp | Backend | Embedding | Model | Dim | Upsert | Search (avg) | Update (avg) | Delete (batch) | Status | |---------|-----------|-------|-----|--------|--------------|--------------|----------------|--------| -| pgvector | openai | text-embedding-3-small | 1536 | 7.06s | 21.26ms | 6.21ms | 22.63ms | โœ… | -| astradb | openai | text-embedding-3-small | 1536 | 18.89s | 23.86s | 1.11s | 15.15s | โœ… | -| milvus | openai | text-embedding-3-small | 1536 | 7.94s | 654.43ms | 569.52ms | 2.17s | โœ… | -| chroma | openai | text-embedding-3-small | 1536 | 17.08s | 654.76ms | 1.23s | 4.73s | โœ… | -| pgvector | gemini | models/gemini-embedding-001 | 1536 | 6.65s | 18.72ms | 6.40ms | 20.25ms | โœ… | -| astradb | gemini | models/gemini-embedding-001 | 1536 | 11.25s | 6.71s | 903.37ms | 15.05s | โœ… | -| milvus | gemini | models/gemini-embedding-001 | 1536 | 6.14s | 571.90ms | 561.38ms | 1.91s | โœ… | -| chroma | gemini | models/gemini-embedding-001 | 1536 | 18.93s | 417.28ms | 1.24s | 4.63s | โœ… | +| pgvector | openai | text-embedding-3-small | 1536 | 7.06s | 21.26ms | 6.21ms | 22.63ms | Yes | +| astradb | openai | text-embedding-3-small | 1536 | 18.89s | 23.86s | 1.11s | 15.15s | Yes | +| milvus | openai | text-embedding-3-small | 1536 | 7.94s | 654.43ms | 569.52ms | 2.17s | Yes | +| chroma | openai | text-embedding-3-small | 1536 | 17.08s | 654.76ms | 1.23s | 4.73s | Yes | +| pgvector | gemini | models/gemini-embedding-001 | 1536 | 6.65s | 18.72ms | 6.40ms | 20.25ms | Yes | +| astradb | gemini | models/gemini-embedding-001 | 1536 | 11.25s | 6.71s | 903.37ms | 15.05s | Yes | +| milvus | gemini | models/gemini-embedding-001 | 1536 | 6.14s | 571.90ms | 561.38ms | 1.91s | Yes | +| chroma | gemini | models/gemini-embedding-001 | 1536 | 18.93s | 417.28ms | 1.24s | 4.63s | Yes | + +> **Important Benchmark Notes:** +> +> - **PgVector**: Benchmarks run against a **local PostgreSQL instance**, providing optimal latency. For fair comparison with cloud backends, ensure PgVector is deployed in the **same region and network environment**. +> - **Cloud Backends** (AstraDB, Milvus, ChromaDB): Results are affected by **network latency** and **regional proximity**. Cloud-hosted PgVector will have different performance characteristics depending on region, network conditions, and infrastructure proximity. +> - **Recommendations**: When comparing results, ensure all backends are deployed in the **same region** and **similar network conditions** for objective evaluation. +> - For production deployments, conduct benchmarks in your **actual production environment** with real network conditions. Full results: [`benchmark.md`](benchmark.md). @@ -53,42 +60,42 @@ Full results: [`benchmark.md`](benchmark.md). ## Features -### ๐Ÿ”Œ Pluggable Architecture +### Pluggable Architecture - **4 Vector Databases**: AstraDB, ChromaDB, Milvus, PgVector - **2 Embedding Providers**: OpenAI, Gemini - Switch backends without code changes - Lazy initialization pattern for optimal resource usage -### ๐ŸŽฏ Unified API +### Unified API - Consistent interface across all adapters - Django-style `get`, `get_or_create`, `update_or_create` semantics - Flexible document input formats: `str`, `dict`, or `VectorDocument` - Standardized error handling with contextual exceptions -### ๐Ÿ” Advanced Querying +### Advanced Querying - **Query DSL**: Type-safe filter composition with `Q` objects - **Universal operators**: `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`, `$in`, `$nin` - **Nested metadata**: Dot-notation paths for hierarchical data - **Metadata-only search**: Query without vector similarity (where supported) -### ๐Ÿš€ Performance Optimized +### Performance Optimized - Automatic batch embedding generation - Bulk operations: `bulk_create`, `bulk_update`, `upsert` - Configurable batch sizes and conflict resolution - Lazy client initialization for faster startup -### ๐Ÿ›ก๏ธ Type-Safe & Validated +### Type-Safe & Validated - Full Pydantic v2 validation - Structured exceptions with detailed context - Centralized logging with configurable levels - Explicit configuration validation with helpful error messages -### โš™๏ธ Flexible Configuration +### Flexible Configuration - Environment variable support via `.env` - Multiple primary key strategies: UUID, hash-based, int64, custom @@ -142,7 +149,7 @@ pip install crossvector[astradb,all-embeddings] ## Quick Start -> ๐Ÿ’ก **Recommended**: Use `GeminiEmbeddingAdapter` for most use cases - free tier, faster search (1.5x), smaller vectors (768 vs 1536 dims). See [benchmarks](benchmark.md) for details. +> **Recommended**: Use `GeminiEmbeddingAdapter` for most use cases - free tier, faster search (1.5x). See [benchmarks](benchmark.md) for details. ### Basic Usage @@ -153,7 +160,7 @@ from crossvector.dbs.pgvector import PgVectorAdapter # Initialize engine with Gemini (recommended: free tier, fast performance) engine = VectorEngine( - embedding=GeminiEmbeddingAdapter(), # Free tier, 1536-dim vectors + embedding=GeminiEmbeddingAdapter(), # Free tier, 1536-dim vectors (default) db=PgVectorAdapter(), collection_name="my_documents", store_text=True @@ -476,11 +483,11 @@ Different backends have varying feature support: | Feature | AstraDB | ChromaDB | Milvus | PgVector | |---------|---------|----------|--------|----------| -| Vector Search | โœ… | โœ… | โœ… | โœ… | -| Metadata-Only Search | โœ… | โœ… | โœ… | โœ… | -| Nested Metadata | โœ… | โœ… | โœ… | โœ… | -| Numeric Comparisons | โœ… | โœ… | โœ… | โœ… | -| Text Storage | โœ… | โœ… | โœ… | โœ… | +| Vector Search | Yes | Yes | Yes | Yes | +| Metadata-Only Search | Yes | Yes | Yes | Yes | +| Nested Metadata | Yes | Yes | Yes | Yes | +| Numeric Comparisons | Yes | Yes | Yes | Yes | +| Text Storage | Yes | Yes | Yes | Yes | *ChromaDB supports nested metadata via dot-notation when metadata is flattened. @@ -562,7 +569,7 @@ engine = VectorEngine(embedding=embedding, db=db) ## Embedding Providers -> ๐Ÿ’ก **Recommended**: Start with **Gemini** for free tier and faster performance. See [benchmark comparison](benchmark.md). +> **Recommended**: Start with **Gemini** for free tier and faster performance. See [benchmark comparison](benchmark.md). ### Gemini (Recommended) @@ -577,10 +584,10 @@ embedding = GeminiEmbeddingAdapter(model_name="models/text-embedding-004", dim=7 ``` **Why Choose Gemini:** -- โœ… **Free tier**: 1,500 requests/min (vs OpenAI paid only) -- โœ… **Faster search**: 234ms avg (1.5x faster than OpenAI) -- โœ… **Efficient**: 768 dims = 50% less storage than OpenAI -- โœ… **Quality**: Comparable accuracy to OpenAI +- **Free tier**: 1,500 requests/min (vs OpenAI paid only) +- **Faster search**: 234ms avg (1.5x faster than OpenAI) +- **Flexible dims**: 768, 1536, or 3072 with gemini-embedding-001 +- **Quality**: Comparable accuracy to OpenAI **Configuration:** ```bash @@ -588,8 +595,10 @@ GEMINI_API_KEY=AI... # Get free key at https://makersuite.google.com/app/apikey ``` **Supported Models:** -- `gemini-embedding-001` (768 dims, **recommended**) -- `models/text-embedding-004` (768 dims) +- `gemini-embedding-001` (1536 dims default, supports 768/1536/3072, **recommended**) +- `text-embedding-005` (768 dims, English and code) +- `text-multilingual-embedding-002` (768 dims, multilingual) +- `text-embedding-004` (768 dims, legacy English) ### OpenAI (Alternative) @@ -604,9 +613,9 @@ embedding = OpenAIEmbeddingAdapter(model_name="text-embedding-3-large") ``` **When to Use OpenAI:** -- โœ… Need 1536 or 3072 dimensions -- โœ… Already have OpenAI API budget -- โœ… Prefer OpenAI ecosystem integration +- Need 1536 or 3072 dimensions +- Already have OpenAI API budget +- Prefer OpenAI ecosystem integration **Configuration:** ```bash @@ -618,10 +627,6 @@ OPENAI_API_KEY=sk-... # Paid API key from https://platform.openai.com - `text-embedding-3-large` (3072 dims) - `text-embedding-ada-002` (1536 dims, legacy) -- `gemini-embedding-001` (1536 dims, default) -- `text-embedding-005` (768 dims) -- `text-embedding-004` (768 dims, legacy) - --- ## Error Handling @@ -809,14 +814,14 @@ The benchmark tool measures performance across 7 key operations: ```markdown | Backend | Embedding | Model | Dim | Bulk Create | Search (avg) | Update (avg) | Delete (batch) | Status | |---------|-----------|-------|-----|-------------|--------------|--------------|----------------|--------| -| pgvector | openai | text-embedding-3-small | 1536 | 2.68s | 515.47ms | 6.48ms | 1.76ms | โœ… | -| astradb | openai | text-embedding-3-small | 1536 | 32.56s | 1.09s | 875.63ms | 1.44s | โœ… | -| milvus | openai | text-embedding-3-small | 1536 | 21.24s | 1.04s | 551.36ms | 180.25ms | โœ… | -| chroma | openai | text-embedding-3-small | 1536 | 36.08s | 900.75ms | 2.51s | 521.35ms | โœ… | -| pgvector | gemini | models/gemini-embedding-001 | 1536 | 31.50s | 65.29ms | 6.14ms | 1.78ms | โœ… | -| astradb | gemini | models/gemini-embedding-001 | 1536 | 1m 2.65s | 882.48ms | 818.93ms | 1.44s | โœ… | -| milvus | gemini | models/gemini-embedding-001 | 1536 | 50.26s | 835.50ms | 572.62ms | 224.16ms | โœ… | -| chroma | gemini | models/gemini-embedding-001 | 1536 | 1m 3.39s | 628.08ms | 3.16s | 394.21ms | โœ… | +| pgvector | openai | text-embedding-3-small | 1536 | 2.68s | 515.47ms | 6.48ms | 1.76ms | Yes | +| astradb | openai | text-embedding-3-small | 1536 | 32.56s | 1.09s | 875.63ms | 1.44s | Yes | +| milvus | openai | text-embedding-3-small | 1536 | 21.24s | 1.04s | 551.36ms | 180.25ms | Yes | +| chroma | openai | text-embedding-3-small | 1536 | 36.08s | 900.75ms | 2.51s | 521.35ms | Yes | +| pgvector | gemini | models/gemini-embedding-001 | 1536 | 31.50s | 65.29ms | 6.14ms | 1.78ms | Yes | +| astradb | gemini | models/gemini-embedding-001 | 1536 | 1m 2.65s | 882.48ms | 818.93ms | 1.44s | Yes | +| milvus | gemini | models/gemini-embedding-001 | 1536 | 50.26s | 835.50ms | 572.62ms | 224.16ms | Yes | +| chroma | gemini | models/gemini-embedding-001 | 1536 | 1m 3.39s | 628.08ms | 3.16s | 394.21ms | Yes | ``` ### Requirements @@ -859,7 +864,7 @@ Results are saved to `benchmark.md` (or custom path) with: **Example output:** ``` -๐Ÿ“„ Markdown report saved to: benchmark.md +Markdown report saved to: benchmark.md ``` See [benchmarking documentation](docs/benchmarking.md) for more details. @@ -1070,4 +1075,4 @@ See [CHANGELOG.md](CHANGELOG.md) for version history and migration guides. --- -**Made with โค๏ธ by the [Two Farm](https://www.linkedin.com/in/thetwofarm/)** +**Made with โค๏ธ by the [The Two Farm](https://www.linkedin.com/in/thetwofarm/)**** diff --git a/docs/adapters/databases.md b/docs/adapters/databases.md index 092940f..62f590e 100644 --- a/docs/adapters/databases.md +++ b/docs/adapters/databases.md @@ -6,12 +6,14 @@ Backend-specific features, capabilities, and configuration for vector databases. CrossVector supports 4 vector database backends: -| Backend | Nested Metadata | Metadata-Only Search | Requires Vector | License | -|---------|----------------|----------------------|-----------------|---------| -| **AstraDB** | โœ… Full | โœ… Yes | โŒ No | Proprietary | -| **ChromaDB** | โŒ Flattened | โœ… Yes | โŒ No | Apache 2.0 | -| **Milvus** | โœ… Full | โŒ No | โœ… Yes | Apache 2.0 | -| **PgVector** | โœ… Full | โœ… Yes | โŒ No | PostgreSQL | +| Backend | Nested Metadata | Metadata-Only Search | License | Recommended For | +|---------|----------------|----------------------|---------|-----------------| +| **AstraDB** | Yes | Yes | Proprietary | Cloud-hosted, serverless, auto-scaling | +| **ChromaDB** | Via Dot Notation | Yes | Apache 2.0 | Prototyping, simple deployments, cloud/local | +| **Milvus** | Yes | Yes | Apache 2.0 | Large-scale, distributed, high-performance | +| **PgVector** | Full JSONB | Yes | PostgreSQL | Existing PostgreSQL infrastructure, ACID | + +*Note: Milvus supports metadata-only via `query()` method, but recommended to always provide vector for optimal performance. --- @@ -21,11 +23,11 @@ DataStax Astra DB - Serverless Cassandra with vector search. ### Features -- โœ… **Full nested metadata** - Complete JSON document support -- โœ… **Metadata-only search** - Filter without vector similarity -- โœ… **Universal operators** - All 8 operators supported -- โœ… **Scalable** - Serverless auto-scaling -- โœ… **Managed** - Fully hosted service +- **Full nested metadata** - Complete JSON document support +- **Metadata-only search** - Filter without vector similarity +- **Universal operators** - All 10 operators supported +- **Scalable** - Serverless auto-scaling +- **Managed** - Fully hosted service ### Installation @@ -57,37 +59,70 @@ db = AstraDBAdapter( ### Schema -AstraDB uses special field names: - -- `_id` - Document primary key -- `$vector` - Embedding vector -- All other fields are metadata +AstraDB accepts flexible primary key field names: -**Example document:** +```python +# All three forms are equivalent - use your preferred convention -```json -{ - "_id": "doc-123", - "$vector": [0.1, 0.2, ...], - "text": "Document content", - "category": "tech", - "author": { - "name": "John", - "role": "admin" - } -} -``` +# Form 1: pk (recommended - cleaner) +doc = engine.create({ + "pk": "doc-123", + "text": "Document content", + "category": "tech", + "author": {"name": "John", "role": "admin"} +}) + +# Form 2: id (common alternative) +doc = engine.create({ + "id": "doc-123", + "text": "Document content", + "category": "tech" +}) + +# Form 3: _id (legacy AstraDB style) +doc = engine.create({ + "_id": "doc-123", + "text": "Document content", + "category": "tech" +}) + +# Form 4: Dynamic (auto-generated if not provided) +doc = engine.create({ + "text": "Document content", + "category": "tech" + # id is auto-generated based on PRIMARY_KEY_MODE setting +}) +``` + +**Behind the scenes:** +- CrossVector extracts `pk`, `id`, or `_id` from input (in priority order) +- All are stored as `_id` in AstraDB (internal requirement) +- Retrieved documents have `id` field for consistency +- Other fields become metadata ### Nested Metadata -Full JSON document support: +Full JSON document support with dynamic and nested queries: ```python from crossvector.querydsl.q import Q -# Deep nesting +# Create with nested metadata (using pk field) +doc = engine.create({ + "pk": "article-1", + "text": "Deep learning guide", + "author": { + "name": "Alice", + "profile": {"verified": True, "tier": "premium"} + }, + "post": { + "stats": {"views": 5000, "likes": 200} + } +}) + +# Query deep nesting with double underscore notation results = engine.search( - "query", + "machine learning", where=Q(author__profile__verified=True) & Q(post__stats__views__gte=1000) ) ``` @@ -146,15 +181,15 @@ Open-source embedding database with Python-first API. ### Features -- โš ๏ธ **Flattened metadata** - No nested object support (auto-flattened) -- โœ… **Metadata-only search** - Filter without vector similarity -- โœ… **Multiple deployment modes** - Cloud, HTTP, or local persistence -- โœ… **Strict config validation** - Prevents conflicting settings -- โœ… **Explicit imports** - Clear dependency management -- โœ… **Lazy initialization** - Optimal resource usage -- โœ… **Common operators** - All 8 operators supported -- โœ… **In-memory/persistent** - Multiple storage backends -- โœ… **Open source** - Apache 2.0 license +- **Nested metadata via dot notation** - Access nested fields using dot syntax (e.g., `user.role`) +- **Metadata-only search** - Filter without vector similarity +- **Multiple deployment modes** - Cloud, HTTP, or local persistence +- **Strict config validation** - Prevents conflicting settings +- **Explicit imports** - Clear dependency management +- **Lazy initialization** - Optimal resource usage +- **All 10 operators** - eq, ne, gt, gte, lt, lte, in, nin, and, or supported +- **In-memory/persistent** - Multiple storage backends +- **Open source** - Apache 2.0 license ### Installation @@ -209,16 +244,16 @@ db = ChromaAdapter() # Uses CHROMA_PERSIST_DIR from env CrossVector enforces strict configuration validation: ```python -# โœ… Valid: Cloud only +# Valid: Cloud only CHROMA_API_KEY="..." -# โœ… Valid: HTTP only +# Valid: HTTP only CHROMA_HOST="localhost" -# โœ… Valid: Local only +# Valid: Local only CHROMA_PERSIST_DIR="./data" -# โŒ Invalid: Conflicting settings +# Invalid: Conflicting settings CHROMA_HOST="localhost" CHROMA_PERSIST_DIR="./data" # Raises: MissingConfigError with helpful message @@ -226,45 +261,66 @@ CHROMA_PERSIST_DIR="./data" ### Schema -ChromaDB automatically flattens nested metadata: +ChromaDB automatically flattens nested metadata using dot notation: -**Input:** +**Input (nested structure):** ```python metadata = { "user": { "name": "John", - "role": "admin" + "role": "admin", + "profile": { + "verified": True + } } } ``` -**Stored as:** +**Stored as (flattened with dots):** ```python { "user.name": "John", - "user.role": "admin" + "user.role": "admin", + "user.profile.verified": True } ``` -### Nested Metadata +**Access via dot notation:** + +```python +from crossvector.querydsl.q import Q + +# Query nested fields using double underscore (converts to dot notation) +results = engine.search( + "query", + where=Q(user__role="admin") & Q(user__profile__verified=True) +) + +# Internally compiled to: {"user.role": {"$eq": "admin"}, "user.profile.verified": {"$eq": True}} +``` + +### Nested Metadata Support -Nested queries work via automatic flattening: +ChromaDB supports nested metadata through automatic dot notation flattening: ```python from crossvector.querydsl.q import Q -# This works (auto-flattened) +# Nested queries work via dot notation results = engine.search( "query", - where=Q(user__role="admin") + where=Q(user__role="admin") & Q(user__profile__verified=True) ) -# Compiled to: {"user.role": {"$eq": "admin"}} +# Compiled to: {"user.role": {"$eq": "admin"}, "user.profile.verified": {"$eq": True}} ``` -**Limitation:** Cannot query nested structures as objects. +**How it works:** +- Double underscore `__` in Q objects maps to dot notation `.` in storage +- Arbitrarily deep nesting is supported +- Queries are automatically flattened to match storage format ### Capabilities @@ -319,7 +375,7 @@ CHROMA_HOST="localhost" CHROMA_PERSIST_DIR="./chroma_data" # Don't mix deployment modes - causes MissingConfigError -# โŒ Don't do: CHROMA_HOST + CHROMA_PERSIST_DIR +# Don't do: CHROMA_HOST + CHROMA_PERSIST_DIR # Batch operations for efficiency engine.bulk_create(docs, batch_size=100) @@ -336,12 +392,12 @@ High-performance distributed vector database. ### Features -- โœ… **Full nested metadata** - JSON field support (via dynamic fields) -- โœ… **Metadata-only search** - Query without vector via `query()` method -- โœ… **Common operators** - All 8 operators supported -- โœ… **High performance** - Distributed architecture -- โœ… **Scalable** - Horizontal scaling -- โœ… **Lazy initialization** - Optimal resource usage +- **Full nested metadata** - JSON field support (via dynamic fields) +- **Metadata-only search** - Query without vector via `query()` method (with `supports_metadata_only=True`) +- **All 10 operators** - eq, ne, gt, gte, lt, lte, in, nin, and, or supported +- **High performance** - Distributed architecture +- **Scalable** - Horizontal scaling +- **Lazy initialization** - Optimal resource usage ### Installation @@ -388,24 +444,27 @@ Q(status__in=["active", "pending"]) # => 'status in ["active", "pending"]' ``` -### Vector Requirement +### Metadata-Only Search Support -**All queries require vector input:** +Milvus supports metadata-only search (no vector required): ```python -# โœ… Correct -results = engine.search("query text", where=Q(category="tech")) +# Correct - Metadata-only query +results = engine.search(query=None, where=Q(category="tech"), limit=10) -# โŒ Error: Milvus requires vector -results = engine.search(query=None, where=Q(category="tech")) +# Also valid - Vector + filter +results = engine.search("query text", where=Q(category="tech")) ``` -**Workaround for metadata-only:** +Check support: ```python -if not engine.supports_metadata_only: - # Use empty string to generate minimal vector - results = engine.search("", where=Q(status="active")) +if engine.supports_metadata_only: + # Can search without vector + results = engine.search(query=None, where=filters) +else: + # Need to provide vector + results = engine.search(vector, where=filters) ``` ### Nested Metadata @@ -429,12 +488,18 @@ results = engine.search( ```python engine = VectorEngine(db=MilvusAdapter(), embedding=...) -# Vector required +# Metadata-only search results = engine.search( - "query text", # Must provide query + query=None, where=Q(category="tech") & Q(score__gte=0.8) ) +# Vector + filter +results = engine.search( + "query text", + where=Q(status="published") & Q(priority__in=[1, 2, 3]) +) + # All operators results = engine.search( "query", @@ -452,13 +517,17 @@ results = engine.search( - **Collection limits:** Billions of vectors - **Throughput:** Very high (distributed) - **Latency:** <10ms (optimized indexes) -- **Cost:** Free (self-hosted) +- **Cost:** Free (self-hosted), pay-as-you-go (Zilliz Cloud) ### Best Practices ```python -# Always provide query vector -results = engine.search("query", where=filters) +# Use metadata-only for fast filtering +if engine.supports_metadata_only: + results = engine.search(query=None, where=filters, limit=100) + +# Combine vector and metadata +results = engine.search("query", where=Q(status="active")) # Use nested metadata metadata = { @@ -481,11 +550,11 @@ PostgreSQL extension for vector similarity search. ### Features -- โœ… **Full nested metadata** - JSONB support with `#>>` operator -- โœ… **Metadata-only search** - Filter without vector similarity -- โœ… **Common operators** - All 8 operators with numeric casting -- โœ… **ACID transactions** - Full PostgreSQL guarantees -- โœ… **Mature ecosystem** - PostgreSQL tooling +- **Full nested metadata** - JSONB support with `#>>` operator +- **Metadata-only search** - Filter without vector similarity +- **All 10 operators** - Supported with numeric casting +- **ACID transactions** - Full PostgreSQL guarantees +- **Mature ecosystem** - PostgreSQL tooling ### Installation @@ -666,58 +735,60 @@ results = engine.search(query=None, where={"status": {"$eq": "active"}}) | Feature | AstraDB | ChromaDB | Milvus | PgVector | |---------|---------|----------|---------|----------| -| **Nested Metadata** | โœ… Full | โŒ Flattened | โœ… Full | โœ… Full (JSONB) | -| **Metadata-Only Search** | โœ… Yes | โœ… Yes | โŒ No | โœ… Yes | -| **Numeric Casting** | โœ… Yes | โš ๏ธ Limited | โœ… Yes | โœ… Auto | -| **Transaction Support** | โŒ No | โŒ No | โŒ No | โœ… ACID | -| **Horizontal Scaling** | โœ… Auto | โŒ No | โœ… Yes | โš ๏ธ Read replicas | -| **Managed Service** | โœ… Yes | โœ… Cloud | โš ๏ธ Zilliz | โŒ Self-host | -| **Open Source** | โŒ No | โœ… Yes | โœ… Yes | โœ… Yes | +| **Nested Metadata** | Full JSON | Via Dot Notation | Full JSON | Full JSONB | +| **Metadata-Only Search** | Yes | Yes | Yes | Yes | +| **Numeric Casting** | Yes | Limited | Yes | Auto | +| **Transaction Support** | No | No | No | ACID | +| **Horizontal Scaling** | Auto | No | Yes | Read replicas | +| **Managed Service** | Yes | Cloud | Zilliz Cloud | Self-host | +| **Open Source** | No | Yes | Yes | Yes | ### Operator Support -All backends support the same 8 operators: +All backends support the same 10 operators: | Operator | AstraDB | ChromaDB | Milvus | PgVector | |----------|---------|----------|---------|----------| -| `$eq` | โœ… | โœ… | โœ… | โœ… | -| `$ne` | โœ… | โœ… | โœ… | โœ… | -| `$gt` | โœ… | โœ… | โœ… | โœ… | -| `$gte` | โœ… | โœ… | โœ… | โœ… | -| `$lt` | โœ… | โœ… | โœ… | โœ… | -| `$lte` | โœ… | โœ… | โœ… | โœ… | -| `$in` | โœ… | โœ… | โœ… | โœ… | -| `$nin` | โœ… | โœ… | โœ… | โœ… | +| `$eq` | Yes | Yes | Yes | Yes | +| `$ne` | Yes | Yes | Yes | Yes | +| `$gt` | Yes | Yes | Yes | Yes | +| `$gte` | Yes | Yes | Yes | Yes | +| `$lt` | Yes | Yes | Yes | Yes | +| `$lte` | Yes | Yes | Yes | Yes | +| `$in` | Yes | Yes | Yes | Yes | +| `$nin` | Yes | Yes | Yes | Yes | +| `and` (&) | Yes | Yes | Yes | Yes | +| `or` (\|) | Yes | Yes | Yes | Yes | ### Use Case Recommendations #### Choose AstraDB if -- โœ… Need managed serverless solution -- โœ… Want full nested metadata support -- โœ… Require high scalability -- โœ… Prefer pay-as-you-go pricing +- Need managed serverless solution +- Want full nested metadata support +- Require high scalability +- Prefer pay-as-you-go pricing #### Choose ChromaDB if -- โœ… Want simple setup (in-memory) -- โœ… Building prototype/MVP -- โœ… Don't need nested metadata -- โœ… Prefer open source +- Want simple setup (in-memory) +- Building prototype/MVP +- Prefer open source +- Need multiple deployment options #### Choose Milvus if -- โœ… Need maximum performance -- โœ… Have large-scale deployment (billions of vectors) -- โœ… Want distributed architecture -- โœ… All queries include vector search +- Need maximum performance +- Have large-scale deployment (billions of vectors) +- Want distributed architecture +- Need full JSON nested metadata #### Choose PgVector if -- โœ… Already using PostgreSQL -- โœ… Need ACID transactions -- โœ… Want full SQL capabilities -- โœ… Prefer mature, stable ecosystem +- Already using PostgreSQL +- Need ACID transactions +- Want full SQL capabilities +- Prefer mature, stable ecosystem --- @@ -755,7 +826,7 @@ doc = engine.create("Document text", category="tech") results = engine.search("query", where=Q(category="tech"), limit=10) ``` -**Only consideration:** Check `engine.supports_metadata_only` for Milvus. +**Only consideration:** Check `engine.supports_metadata_only` for Milvus (it's now supported, but verify with your deployment). --- diff --git a/docs/adapters/embeddings.md b/docs/adapters/embeddings.md index 2628bee..079f859 100644 --- a/docs/adapters/embeddings.md +++ b/docs/adapters/embeddings.md @@ -8,13 +8,14 @@ CrossVector supports multiple embedding providers, with **Google Gemini** recomm ### Comparison Matrix -| Feature | ๐Ÿฅ‡ Google Gemini | ๐Ÿฅˆ OpenAI | +| Feature | Google Gemini | OpenAI | |---------|-----------------|-----------| -| **Best For** | **Free tier & Speed** | Quality & Ecosystem | -| **Free Tier** | โœ… **1,500 RPM (Generous)** | โŒ No (Paid only) | -| **Search Speed** | โšก **Fast (~200ms)** | โšก Fast (~400ms) | -| **Storage** | ๐Ÿ“‰ **Small (768 dims)** | ๐Ÿ“ˆ Large (1536 dims) | -| **Models** | `text-embedding-004` | `text-embedding-3-small` | +| **Best For** | Free tier & Speed | Quality & Ecosystem | +| **Free Tier** | 1,500 RPM (Generous) | No (Paid only) | +| **Default Model** | `models/text-embedding-004` (768 dims) | `text-embedding-3-small` (1536 dims) | +| **Custom Dimensions** | Yes (models/gemini-embedding-001: 768, 1536, 3072) | No (fixed per model) | +| **Search Speed** | Fast (~200ms) | Fast (~400ms) | +| **Storage** | Small (768 dims default) | Large (1536+ dims) | | **Max Tokens** | 2,048 | 8,191 | | **Cost** | Free / Low | $0.02 / 1M tokens | @@ -26,10 +27,11 @@ Google's state-of-the-art embedding models via Gemini API. ### Why Gemini? -- โœ… **Free Tier**: Up to 1,500 requests/minute for free. -- โœ… **Faster**: 1.5x faster search latency than OpenAI. -- โœ… **Storage Efficient**: 768 dimensions require 50% less storage than 1536. -- โœ… **Quality**: Excellent performance for search and retrieval. +- **Free Tier**: Up to 1,500 requests/minute for free. +- **Faster**: 1.5x faster search latency than OpenAI. +- **Storage Efficient**: Default 768 dimensions (50% smaller than OpenAI's 1536). +- **Flexible**: Optional 1536 or 3072 dimensions if needed (models/gemini-embedding-001). +- **Quality**: Excellent performance for search and retrieval tasks. ### Installation @@ -61,46 +63,47 @@ embedding = GeminiEmbeddingAdapter(model_name="models/embedding-001") ### Available Models -#### models/text-embedding-004 (Default) +#### models/text-embedding-004 (Default - Latest) -Latest generation model, balanced for performance and quality. +Latest Gemini embedding model, state-of-the-art performance. -- **Dimensions:** 768 +- **Dimensions:** 768 (default) - **Max tokens:** 2,048 -- **Best for:** Most search and RAG applications. +- **Best for:** Most search, RAG, and semantic matching applications +- **Performance:** Fast, balanced quality ```python embedding = GeminiEmbeddingAdapter(model_name="models/text-embedding-004") ``` -#### models/embedding-001 +#### models/gemini-embedding-001 -Previous generation, widely supported. +State-of-the-art with flexible dimensions. -- **Dimensions:** 768 +- **Dimensions:** 1536 (default), supports 768 or 3072 - **Max tokens:** 2,048 -- **Task Types:** Supports specific task optimization. +- **Best for:** Applications needing custom embedding dimensions +- **Quality:** Excellent across multilingual and code tasks ```python +# Default 1536 dimensions +embedding = GeminiEmbeddingAdapter(model_name="models/gemini-embedding-001") + +# Custom dimensions embedding = GeminiEmbeddingAdapter( - model_name="models/embedding-001", - task_type="retrieval_document" + model_name="models/gemini-embedding-001", + dim=768 # 768, 1536, or 3072 ) ``` -### Task Types +#### Legacy Models -Optimize embeddings for specific use cases (supported by `embedding-001`): +- `models/text-embedding-005`: English and code (768 dims) +- `models/text-multilingual-embedding-002`: Multilingual (768 dims) +- `models/text-embedding-004`: Previous generation (768 dims) ```python -# For storing documents -embedding = GeminiEmbeddingAdapter(task_type="RETRIEVAL_DOCUMENT") - -# For search queries -embedding = GeminiEmbeddingAdapter(task_type="RETRIEVAL_QUERY") - -# For semantic similarity -embedding = GeminiEmbeddingAdapter(task_type="SEMANTIC_SIMILARITY") +embedding = GeminiEmbeddingAdapter(model_name="models/text-embedding-005") ``` ### Usage Examples @@ -112,16 +115,26 @@ from crossvector import VectorEngine from crossvector.embeddings.gemini import GeminiEmbeddingAdapter from crossvector.dbs.pgvector import PgVectorAdapter -# Initialize with Gemini +# Initialize with Gemini (default: models/text-embedding-004, 768 dims) engine = VectorEngine( db=PgVectorAdapter(), embedding=GeminiEmbeddingAdapter(), collection_name="documents" ) -# Embeddings generated automatically +# Embeddings generated automatically (768 dims) doc = engine.create("Gemini embeddings are fast!") print(len(doc.vector)) # 768 + +# Or use gemini-embedding-001 with custom dimensions +engine = VectorEngine( + db=PgVectorAdapter(), + embedding=GeminiEmbeddingAdapter( + model_name="models/gemini-embedding-001", + dim=1536 # 768, 1536, or 3072 + ), + collection_name="documents" +) ``` #### Batch Processing @@ -143,9 +156,9 @@ OpenAI's industry-standard embedding models. ### When to use OpenAI? -- โœ… **Long Documents**: Supports up to 8,191 tokens per text. -- โœ… **High Dimensions**: Need 1536 or 3072 dimensions. -- โœ… **Ecosystem**: Already using OpenAI for LLMs. +- **Long Documents**: Supports up to 8,191 tokens per text. +- **High Dimensions**: Need 1536 or 3072 dimensions. +- **Ecosystem**: Already using OpenAI for LLMs. ### Installation @@ -222,18 +235,29 @@ You can implement your own adapter for any provider (HuggingFace, Cohere, etc.): ```python from crossvector.abc import EmbeddingAdapter -from typing import List +from typing import List, Optional class CustomEmbeddingAdapter(EmbeddingAdapter): + def __init__(self, model_name: str = "custom-model", dim: int = 384): + super().__init__(model_name=model_name, dim=dim) + # Your initialization logic + def get_embeddings(self, texts: List[str]) -> List[List[float]]: - # Your custom logic here - return [[0.1, 0.2] for _ in texts] + """Generate embeddings for texts.""" + # Your custom embedding logic here + # Must return list of vectors with length = self.dim + return [[0.1, 0.2, ...] for _ in texts] - @property - def dimensions(self) -> int: - return 2 # Return actual dimensions +# Use your custom adapter +embedding = CustomEmbeddingAdapter(dim=384) +engine = VectorEngine(db=..., embedding=embedding) ``` +**Important:** Your adapter must: +- Inherit from `EmbeddingAdapter` +- Implement `get_embeddings(texts: List[str]) -> List[List[float]]` +- Set `dim` to match the vector dimension your model produces + --- ## Error Handling @@ -241,13 +265,17 @@ class CustomEmbeddingAdapter(EmbeddingAdapter): Handle embedding errors gracefully: ```python -from crossvector.exceptions import EmbeddingError +from crossvector.exceptions import SearchError, MissingConfigError try: - embedding.get_embeddings(["text"]) -except EmbeddingError as e: + embedding = GeminiEmbeddingAdapter() + vectors = embedding.get_embeddings(["text"]) +except MissingConfigError as e: + print(f"Missing configuration: {e.details['config_key']}") + print(f"Hint: {e.details['hint']}") +except SearchError as e: print(f"Embedding failed: {e.message}") - if "quota" in str(e).lower(): + if "rate" in str(e).lower(): print("Rate limit exceeded!") ``` diff --git a/docs/api.md b/docs/api.md index 7b0f875..84df070 100644 --- a/docs/api.md +++ b/docs/api.md @@ -72,6 +72,24 @@ if engine.supports_metadata_only: results = engine.search(query=None, where={"status": {"$eq": "active"}}) ``` +#### `engine.supports_vector_search` + +Check if the backend supports vector similarity search. + +```python +if engine.supports_vector_search: + results = engine.search("query text") +``` + +#### `engine.collection_name` + +Get the active collection name. + +```python +name = engine.collection_name +print(f"Using collection: {name}") +``` + --- ## Document Operations @@ -139,20 +157,18 @@ Create multiple documents in batch. ```python bulk_create( docs: List[str | Dict | VectorDocument], - batch_size: int = None, + batch_size: int = 100, ignore_conflicts: bool = False, - update_conflicts: bool = False, - update_fields: List[str] = None + update_conflicts: bool = False ) -> List[VectorDocument] ``` **Parameters:** - `docs`: List of documents to create -- `batch_size`: Number of documents per batch (backend-specific default) +- `batch_size`: Number of documents per batch (default: 100) - `ignore_conflicts`: Skip documents with conflicting IDs - `update_conflicts`: Update existing documents on ID conflict -- `update_fields`: Fields to update on conflict (None = all fields) **Returns:** List of created `VectorDocument` instances @@ -231,16 +247,16 @@ Update multiple documents in batch. ```python bulk_update( docs: List[Dict | VectorDocument], - batch_size: int = None, - update_fields: List[str] = None + batch_size: int = 100, + ignore_conflicts: bool = False ) -> List[VectorDocument] ``` **Parameters:** - `docs`: List of documents to update (must include ID) -- `batch_size`: Number of documents per batch -- `update_fields`: Specific fields to update (None = all) +- `batch_size`: Number of documents per batch (default: 100) +- `ignore_conflicts`: Skip documents that don't exist instead of raising error **Returns:** List of updated `VectorDocument` instances @@ -421,12 +437,12 @@ doc, created = engine.update_or_create( Delete documents by ID. ```python -delete(ids: str | List[str]) -> int +delete(*ids) -> int ``` **Parameters:** -- `ids`: Single ID or list of IDs to delete +- `*ids`: One or more document IDs to delete **Returns:** Number of documents deleted @@ -534,49 +550,98 @@ print(f"Total documents: {total}") Delete the entire collection. ```python -drop_collection(collection_name: str) -> bool +drop_collection(collection_name: str = None) -> bool ``` **Parameters:** -- `collection_name`: Name of collection to drop +- `collection_name`: Name of collection to drop (defaults to active collection) **Returns:** True if successful **Warning:** This permanently deletes all documents in the collection. -**Example:** +**Examples:** ```python +# Drop specific collection engine.drop_collection("old_collection") + +# Drop active collection +engine.drop_collection() ``` --- -### clear_collection() +### add_collection() -Delete all documents from the collection (keep collection structure). +Create a new collection. ```python -clear_collection() -> int +add_collection(collection_name: str, dimension: int, metric: str = "cosine") -> None ``` -**Returns:** Number of documents deleted +**Parameters:** -**Warning:** This permanently deletes all documents. +- `collection_name`: Name for the new collection +- `dimension`: Vector dimension +- `metric`: Distance metric ("cosine", "euclidean", "dot_product") **Example:** ```python -deleted = engine.clear_collection() -print(f"Deleted {deleted} documents") +engine.add_collection("new_collection", dimension=1536, metric="cosine") ``` --- -## Query DSL +### get_collection() + +Get an existing collection. + +```python +get_collection(collection_name: str = None) -> Any +``` + +**Parameters:** + +- `collection_name`: Name of collection (defaults to active collection) + +**Returns:** Collection object/handle + +**Example:** + +```python +collection = engine.get_collection("my_collection") +``` + +--- + +### get_or_create_collection() + +Get existing collection or create if it doesn't exist. + +```python +get_or_create_collection(collection_name: str, dimension: int, metric: str = "cosine") -> Any +``` + +**Parameters:** -### Q Objects +- `collection_name`: Name of collection +- `dimension`: Vector dimension +- `metric`: Distance metric + +**Returns:** Collection object/handle + +**Example:** + +```python +collection = engine.get_or_create_collection("docs", dimension=1536) +``` + +--- + +## Query DSL Composable query filters. @@ -782,18 +847,14 @@ VectorDocument.from_any(input_data) # Auto-detect format ## Type Definitions ```python -from crossvector.types import Doc, DocId, DocIds +from crossvector.types import Doc -# Doc: Flexible document input +# Doc: Flexible document input for create/update operations Doc = Union[str, Dict[str, Any], VectorDocument] - -# DocId: Single document ID -DocId = Union[str, int] - -# DocIds: Single or multiple document IDs -DocIds = Union[DocId, List[DocId]] ``` +Common type for document operations. The engine automatically handles conversion between string, dict, and VectorDocument formats. + --- ## Next Steps diff --git a/docs/architecture.md b/docs/architecture.md index 3c631ff..15c8234 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -65,12 +65,15 @@ The main interface for all vector operations. class VectorEngine: def create(self, doc, **kwargs) -> VectorDocument def bulk_create(self, docs, **kwargs) -> List[VectorDocument] + def bulk_update(self, docs, **kwargs) -> List[VectorDocument] + def upsert(self, docs, **kwargs) -> List[VectorDocument] def update(self, doc, **kwargs) -> VectorDocument - def delete(self, ids) -> int + def delete(self, *ids) -> int def get(self, *args, **kwargs) -> VectorDocument - def search(self, query, where, limit) -> List[VectorDocument] + def search(self, query, where=None, limit=None) -> List[VectorDocument] def get_or_create(self, doc, **kwargs) -> Tuple[VectorDocument, bool] def update_or_create(self, lookup, **kwargs) -> Tuple[VectorDocument, bool] + def count() -> int ``` **Input Normalization:** @@ -135,19 +138,32 @@ class VectorDBAdapter(ABC): @abstractmethod def get_or_create_collection(self, collection_name, dim, metric) -> Any + @abstractmethod + def drop_collection(self, collection_name) -> bool + + @abstractmethod + def clear_collection(self) -> int + @abstractmethod def create(self, doc: VectorDocument) -> VectorDocument @abstractmethod def bulk_create(self, docs: List[VectorDocument], **kwargs) -> List[VectorDocument] + @abstractmethod + def bulk_update(self, docs: List[VectorDocument], **kwargs) -> List[VectorDocument] + + @abstractmethod + def upsert(self, docs: List[VectorDocument], **kwargs) -> List[VectorDocument] + @abstractmethod def search( self, vector: List[float] | None, - where: Dict[str, Any] | None, limit: int, - offset: int + offset: int, + where: Dict[str, Any] | None, + fields: Set[str] | None ) -> List[VectorDocument] @abstractmethod @@ -157,7 +173,7 @@ class VectorDBAdapter(ABC): def update(self, doc: VectorDocument, **kwargs) -> VectorDocument @abstractmethod - def delete(self, ids: DocIds) -> int + def delete(self, *ids) -> int @abstractmethod def count(self) -> int @@ -203,8 +219,8 @@ class EmbeddingAdapter(ABC): def get_embeddings(self, texts: List[str]) -> List[List[float]] @property - @abstractmethod - def dimensions(self) -> int + def dim(self) -> int + """The dimension of embeddings generated by the model.""" ``` **Implementation Example:** @@ -214,15 +230,15 @@ class GeminiEmbeddingAdapter(EmbeddingAdapter): def __init__(self, api_key, model_name="models/text-embedding-004"): self.api_key = api_key self.model_name = model_name - self._dimensions = 768 + self._dim = 768 def get_embeddings(self, texts: List[str]) -> List[List[float]]: # Implementation detail... return vectors @property - def dimensions(self) -> int: - return self._dimensions + def dim(self) -> int: + return self._dim ``` --- @@ -537,17 +553,22 @@ Implement `VectorDBAdapter`: ```python class CustomDBAdapter(VectorDBAdapter): - SUPPORTS_METADATA_ONLY = True + supports_metadata_only = True + where_compiler = CustomWhereCompiler() - def add_collection(self, collection_name, dimension): - # Implementation + def initialize(self, collection_name, dim, metric, **kwargs): + # Initialize client and collection pass - def insert(self, collection_name, documents): - # Implementation + def create(self, doc: VectorDocument) -> VectorDocument: + # Insert document + pass + + def search(self, vector, limit, offset, where, fields): + # Perform search pass - # ... other methods + # ... implement all other abstract methods ``` ### Custom Embedding Adapter @@ -556,13 +577,16 @@ Implement `EmbeddingAdapter`: ```python class CustomEmbeddingAdapter(EmbeddingAdapter): - def get_embeddings(self, texts): + def __init__(self, model_name, dim=None): + super().__init__(model_name, dim) + + def get_embeddings(self, texts: List[str]) -> List[List[float]]: # Implementation return vectors @property - def dimensions(self): - return 768 + def dim(self) -> int: + return self._dim ``` ### Custom WhereCompiler @@ -581,18 +605,29 @@ class CustomWhereCompiler(WhereCompiler): ### Custom PK Factory -Provide callable to generate IDs: +Provide dotted path to callable for ID generation: ```python -def custom_pk_factory() -> str: - return f"doc-{int(time.time())}-{random.randint(1000, 9999)}" +# In mymodule.py +def custom_id_generator(text: str | None, vector: List[float] | None, metadata: Dict) -> str: + """Generate custom ID from text, vector, or metadata.""" + if text: + return f"doc-{text[:20]}-{uuid.uuid4()}" + return f"doc-{uuid.uuid4()}" -settings = CrossVectorSettings( - PK_STRATEGY="custom", - PK_FACTORY=custom_pk_factory -) +# In .env or settings +PRIMARY_KEY_MODE=custom +PRIMARY_KEY_FACTORY=mymodule.custom_id_generator +``` + +**Or via code:** + +```python +from crossvector.settings import settings -engine = VectorEngine(db=..., embedding=..., settings=settings) +# Set custom factory +settings.PRIMARY_KEY_MODE = "custom" +settings.PRIMARY_KEY_FACTORY = "mymodule.custom_id_generator" ``` --- @@ -604,11 +639,11 @@ engine = VectorEngine(db=..., embedding=..., settings=settings) Use bulk operations for efficiency: ```python -# โœ… Good: Batch insert +# Good: Batch insert docs = [{"text": f"Doc {i}"} for i in range(1000)] engine.bulk_create(docs, batch_size=100) -# โŒ Bad: Individual inserts +# Bad: Individual inserts for doc in docs: engine.create(doc) ``` @@ -752,11 +787,11 @@ def test_documents(): ### API Key Management ```python -# โœ… Good: Environment variables +# Good: Environment variables import os api_key = os.getenv("GEMINI_API_KEY") -# โŒ Bad: Hard-coded +# Bad: Hard-coded api_key = "sk-..." ``` diff --git a/docs/benchmarking.md b/docs/benchmarking.md index ef68304..5822717 100644 --- a/docs/benchmarking.md +++ b/docs/benchmarking.md @@ -30,16 +30,22 @@ python scripts/benchmark.py --backends pgvector milvus --embedding-providers ope python scripts/benchmark.py [OPTIONS] Options: - --num-docs INT Number of documents to test (default: 1000) - --backends NAME [NAME ...] Specific backends: pgvector, astradb, milvus, chroma - --embedding-providers NAME Embedding providers: openai, gemini - --output PATH Output file path (default: benchmark.md) + --num-docs INT Number of documents to test (default: 1000) + --backends NAME [NAME ...] Specific backends: pgvector, astradb, milvus, chroma + --embedding-providers NAME Embedding providers: openai, gemini + --skip-slow Skip slow cloud backends (astradb, milvus) for faster testing + --search-limit INT Number of results to return in search operations (default: 100) + --collection-name STR Custom collection name (default: auto-generate with UUID8) + --timeout INT Timeout per backend test in seconds (default: 60) + --output PATH Output file path (default: benchmark.md) + --use-fixtures PATH Path to pre-generated fixtures JSON file + --add-vectors Generate and add vectors to fixture documents ``` ## What Gets Measured -### 1. Bulk Create Performance -Measures throughput for batch document insertion with automatic embedding generation. +### 1. Upsert Performance +Measures throughput for batch document upsert with automatic embedding generation. **Metrics:** - Duration (seconds) @@ -203,14 +209,14 @@ Results are saved as a markdown file (default: `benchmark.md`) with: | Backend | Embedding | Model | Dim | Upsert | Search (avg) | Update (avg) | Delete (batch) | Status | |---------|-----------|-------|-----|--------|--------------|--------------|----------------|--------| -| pgvector | openai | text-embedding-3-small | 1536 | 7.06s | 21.26ms | 6.21ms | 22.63ms | โœ… | -| astradb | openai | text-embedding-3-small | 1536 | 18.89s | 23.86s | 1.11s | 15.15s | โœ… | -| milvus | openai | text-embedding-3-small | 1536 | 7.94s | 654.43ms | 569.52ms | 2.17s | โœ… | -| chroma | openai | text-embedding-3-small | 1536 | 17.08s | 654.76ms | 1.23s | 4.73s | โœ… | -| pgvector | gemini | models/gemini-embedding-001 | 1536 | 6.65s | 18.72ms | 6.40ms | 20.25ms | โœ… | -| astradb | gemini | models/gemini-embedding-001 | 1536 | 11.25s | 6.71s | 903.37ms | 15.05s | โœ… | -| milvus | gemini | models/gemini-embedding-001 | 1536 | 6.14s | 571.90ms | 561.38ms | 1.91s | โœ… | -| chroma | gemini | models/gemini-embedding-001 | 1536 | 18.93s | 417.28ms | 1.24s | 4.63s | โœ… | +| pgvector | openai | text-embedding-3-small | 1536 | 7.06s | 21.26ms | 6.21ms | 22.63ms | OK | +| astradb | openai | text-embedding-3-small | 1536 | 18.89s | 23.86s | 1.11s | 15.15s | OK | +| milvus | openai | text-embedding-3-small | 1536 | 7.94s | 654.43ms | 569.52ms | 2.17s | OK | +| chroma | openai | text-embedding-3-small | 1536 | 17.08s | 654.76ms | 1.23s | 4.73s | OK | +| pgvector | gemini | models/gemini-embedding-001 | 1536 | 6.65s | 18.72ms | 6.40ms | 20.25ms | OK | +| astradb | gemini | models/gemini-embedding-001 | 1536 | 11.25s | 6.71s | 903.37ms | 15.05s | OK | +| milvus | gemini | models/gemini-embedding-001 | 1536 | 6.14s | 571.90ms | 561.38ms | 1.91s | OK | +| chroma | gemini | models/gemini-embedding-001 | 1536 | 18.93s | 417.28ms | 1.24s | 4.63s | OK | ``` ### Interpreting Metrics @@ -281,7 +287,7 @@ Or use a markdown diff tool for better visualization. If you see: ``` -โš ๏ธ AstraDB not available: Missing ASTRADB_API_ENDPOINT +AstraDB not available: Missing ASTRADB_API_ENDPOINT ``` Solution: Set the required environment variables or the backend will be skipped. @@ -290,7 +296,7 @@ Solution: Set the required environment variables or the backend will be skipped. If you see rate limit errors: ``` -โŒ bulk_create failed: Rate limit exceeded +bulk_create failed: Rate limit exceeded ``` Solutions: @@ -330,7 +336,7 @@ Extend `benchmark_backend()` method to add custom metrics: # After existing benchmarks, add: # Custom metric -print(f"\n7๏ธโƒฃ Custom Metric...") +print("\nCustom Metric...") duration, result = benchmark_operation("custom", lambda: engine.custom_operation()) results["custom_metric"] = {"duration": duration} ``` @@ -392,4 +398,4 @@ python scripts/benchmark.py --backends milvus --num-docs 5000 ## Contributing -Found a performance issue or want to add a new benchmark metric? See [Contributing Guide](contributing.md#performance-testing). +Found a performance issue or want to add a new benchmark metric? See [Contributing Guide](contributing.md#benchmarking). diff --git a/docs/configuration.md b/docs/configuration.md index a8adfe4..903121d 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -46,7 +46,7 @@ PGVECTOR_PASSWORD=postgres # Vector Settings VECTOR_METRIC=cosine -VECTOR_STORE_TEXT=true +VECTOR_STORE_TEXT=false VECTOR_DIM=1536 VECTOR_SEARCH_LIMIT=10 @@ -82,8 +82,9 @@ GEMINI_API_KEY=AI... # Required: Your Gemini API key Supported models (defaults to `gemini-embedding-001`): -- `gemini-embedding-001` (768-3072 dims, recommended) -- `text-embedding-005` (768 dims) +- `gemini-embedding-001` (768-3072 dims, unified state-of-the-art model, recommended) +- `text-embedding-005` (768 dims, English/code specialist, legacy) +- `text-multilingual-embedding-002` (768 dims, multilingual specialist, legacy) - `text-embedding-004` (768 dims, legacy) #### Shared Embedding Model (Optional) @@ -145,7 +146,7 @@ ChromaDB adapter uses strict configuration validation with this priority: **Important:** Cannot set both `CHROMA_HOST` and `CHROMA_PERSIST_DIR` simultaneously. This will raise `MissingConfigError` with a helpful message explaining the conflict. ```python -# โœ… Valid configurations: +# Valid configurations: # Cloud only CHROMA_API_KEY="..." @@ -155,7 +156,7 @@ CHROMA_HOST="localhost" # Local only CHROMA_PERSIST_DIR="./data" -# โŒ Invalid - raises MissingConfigError: +# Invalid - raises MissingConfigError: CHROMA_HOST="localhost" CHROMA_PERSIST_DIR="./data" # Conflict! ``` @@ -192,8 +193,8 @@ PGVECTOR_PASSWORD=postgres # Optional: Default postgres VECTOR_METRIC=cosine # Options: cosine, euclidean, dot_product -# Whether to store original text with vectors -VECTOR_STORE_TEXT=true +# Whether to store original text with vectors (default: false) +VECTOR_STORE_TEXT=false # Options: true, false # Default embedding dimension (informational) diff --git a/docs/contributing.md b/docs/contributing.md index 931ca98..4f050e4 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -282,14 +282,16 @@ class NewDBAdapter(VectorDBAdapter): ```python # src/crossvector/querydsl/compilers/newdb.py -from crossvector.querydsl.compilers.base import WhereCompiler +from crossvector.querydsl.compilers.base import BaseWhere from typing import Dict, Any -class NewDBWhereCompiler(WhereCompiler): +class NewDBWhereCompiler(BaseWhere): """Compile filters for NewDB.""" - SUPPORTS_NESTED = True - REQUIRES_VECTOR = False + # Capability flags + SUPPORTS_NESTED = True # Supports nested fields + REQUIRES_VECTOR = False # Can search metadata-only + REQUIRES_AND_WRAPPER = False # Multiple fields use implicit AND _OP_MAP = { "$eq": "==", @@ -302,9 +304,13 @@ class NewDBWhereCompiler(WhereCompiler): "$nin": "not in", } - def compile(self, where: Dict[str, Any]) -> str: + def to_where(self, where: Dict[str, Any]) -> str: """Compile to NewDB filter format.""" pass + + def to_expr(self, where: Dict[str, Any]) -> str: + """Convert to expression string.""" + pass ``` 1. **Add tests:** @@ -357,18 +363,12 @@ class NewProviderEmbeddingAdapter(EmbeddingAdapter): model_name: str = "default-model" ): self.api_key = api_key - self.model_name = model_name - self._dimensions = 768 + super().__init__(model_name=model_name, dim=768) def get_embeddings(self, texts: List[str]) -> List[List[float]]: """Generate embeddings for texts.""" # Implementation pass - - @property - def dimensions(self) -> int: - """Return embedding dimensions.""" - return self._dimensions ``` 1. **Add tests:** @@ -604,8 +604,8 @@ twine upload dist/* ### Getting Help - Check existing [documentation](https://thewebscraping.github.io/crossvector/) -- Search [issues](https://github.com/yourusername/crossvector/issues) -- Ask in [discussions](https://github.com/yourusername/crossvector/discussions) +- Search [issues](https://github.com/thewebscraping/crossvector/issues) +- Ask in [discussions](https://github.com/thewebscraping/crossvector/discussions) ### Reporting Bugs @@ -668,4 +668,4 @@ Feel free to ask questions in: - GitHub Discussions (for general questions) - Pull Request comments (for specific code questions) -Thank you for contributing to CrossVector! ๐ŸŽ‰ +Thank you for contributing to CrossVector! diff --git a/docs/index.md b/docs/index.md index af78fd8..2b367c4 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,12 +8,12 @@ CrossVector provides a consistent, high-level API across multiple vector databas ## Key Features -- **๐Ÿ”Œ Pluggable Architecture**: 4 vector databases, 2 embedding providers, lazy initialization -- **๐ŸŽฏ Unified API**: Consistent interface across all adapters with standardized error handling -- **๐Ÿ” Advanced Querying**: Type-safe Query DSL with Q objects -- **๐Ÿš€ Performance**: Automatic batch embedding, bulk operations, lazy client initialization -- **๐Ÿ›ก๏ธ Type-Safe**: Full Pydantic v2 validation and structured exceptions -- **โš™๏ธ Flexible Configuration**: Environment variables, explicit config validation, multiple PK strategies +- **Pluggable Architecture**: 4 vector databases, 2 embedding providers, lazy initialization +- **Unified API**: Consistent interface across all adapters with standardized error handling +- **Advanced Querying**: Type-safe Query DSL with Q objects +- **Performance**: Automatic batch embedding, bulk operations, lazy client initialization +- **Type-Safe**: Full Pydantic v2 validation and structured exceptions +- **Flexible Configuration**: Environment variables, explicit config validation, multiple PK strategies ## Quick Navigation @@ -41,7 +41,7 @@ CrossVector provides a consistent, high-level API across multiple vector databas ## Quick Example -> ๐Ÿ’ก **Recommended**: Use Gemini for free tier and faster performance. [See why โ†’](quickstart.md) +> **Recommended**: Use Gemini for free tier and faster performance. [See why โ†’](quickstart.md) ```python from crossvector import VectorEngine @@ -61,9 +61,9 @@ results = engine.search("vector database library", limit=5) ``` **Why Gemini?** -- โœ… Free API tier (1,500 RPM) -- โœ… 1.5x faster search than OpenAI -- โœ… 50% smaller vectors (768 vs 1536 dims) +- Free API tier (1,500 RPM) +- 1.5x faster search than OpenAI +- 50% smaller vectors (768 vs 1536 dims) **With OpenAI?** [See alternative setup โ†’](quickstart.md#using-openai-instead) @@ -79,20 +79,20 @@ results = engine.search( | Feature | AstraDB | ChromaDB | Milvus | PgVector | |---------|---------|----------|--------|----------| -| Vector Search | โœ… | โœ… | โœ… | โœ… | -| Metadata-Only Search | โœ… | โœ… | โœ… | โœ… | -| Nested Metadata | โœ… | โœ…* | โŒ | โœ… | -| Numeric Comparisons | โœ… | โœ… | โœ… | โœ… | -| Lazy Initialization | โœ… | โœ… | โœ… | โœ… | -| Config Validation | โœ… | โœ… | โœ… | โœ… | +| Vector Search | Yes | Yes | Yes | Yes | +| Metadata-Only Search | Yes | Yes | Yes | Yes | +| Nested Metadata | Yes | Yes* | No | Yes | +| Numeric Comparisons | Yes | Yes | Yes | Yes | +| Lazy Initialization | Yes | Yes | Yes | Yes | +| Config Validation | Yes | Yes | Yes | Yes | -*ChromaDB supports nested metadata via dot-notation when flattened. +ChromaDB supports nested metadata via dot-notation when flattened. ## Status **Current Version**: 0.1.0 (Beta) -โš ๏ธ **Beta Status**: CrossVector is currently in beta. Do not use in production until version 1.0. +**Beta Status**: CrossVector is currently in beta. Do not use in production until version 1.0. - API may change without notice - Database schemas may evolve @@ -100,9 +100,9 @@ results = engine.search( **Recommended for:** -- โœ… Prototyping and development -- โœ… Learning vector databases -- โŒ Production applications +- Prototyping and development +- Learning vector databases +- Production applications ## Support @@ -112,4 +112,4 @@ results = engine.search( ## License -CrossVector is released under the MIT License. See [LICENSE](../LICENSE) for details. +CrossVector is released under the MIT License. See [LICENSE](https://github.com/thewebscraping/crossvector/blob/main/LICENSE) for details. diff --git a/docs/installation.md b/docs/installation.md index 4d71f26..5ed5236 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -78,6 +78,30 @@ pip install crossvector[all] This includes all backends and all embedding providers. +### 6. From Git Repository + +Install directly from the GitHub repository: + +```bash +# Latest main branch +pip install git+https://github.com/thewebscraping/crossvector.git + +# With specific extras +pip install git+https://github.com/thewebscraping/crossvector.git#egg=crossvector[astradb,openai] + +# Specific branch +pip install git+https://github.com/thewebscraping/crossvector.git@main#egg=crossvector[all] + +# Specific tag/version +pip install git+https://github.com/thewebscraping/crossvector.git@v1.0.0#egg=crossvector +``` + +This is useful for: + +- Testing development versions before release +- Contributing to the project +- Using features from a specific branch + ## Optional Dependencies Reference ### Database Adapters @@ -86,7 +110,7 @@ This includes all backends and all embedding providers. |-------|----------|----------| | `astradb` | `astrapy>=2.1.0` | AstraDB serverless | | `chromadb` | `chromadb>=1.3.4` | ChromaDB cloud/local | -| `milvus` | `pymilvus>=2.6.3` | Milvus/Zilliz cloud | +| `milvus` | `pymilvus>=2.6.4` | Milvus/Zilliz cloud | | `pgvector` | `pgvector>=0.4.1`, `psycopg2-binary>=2.9.11` | PostgreSQL with pgvector extension | | `all-dbs` | All of the above | All backends | @@ -122,7 +146,7 @@ print(crossvector.__version__) from crossvector import VectorEngine, VectorDocument from crossvector.querydsl.q import Q -print("โœ… CrossVector installed successfully!") +print("CrossVector installed successfully!") ``` ## Upgrading @@ -133,10 +157,10 @@ To upgrade to the latest version: pip install --upgrade crossvector[your-extras] ``` -**Important**: During beta, pin to specific versions to avoid breaking changes: +**Important**: Pin to specific versions for reproducible environments: ```bash -pip install crossvector[astradb,openai]==0.1.0 +pip install crossvector[astradb,openai]==1.0.0 ``` ## Troubleshooting @@ -175,7 +199,7 @@ pip install crossvector[your-extras] For reproducible environments, use a requirements.txt: ```txt -crossvector[astradb,openai]==0.1.0 +crossvector[astradb,openai]==1.0.0 # Or with specific dependencies astrapy==2.1.0 openai==2.6.1 diff --git a/docs/querydsl.md b/docs/querydsl.md index 14b1494..884013d 100644 --- a/docs/querydsl.md +++ b/docs/querydsl.md @@ -68,10 +68,10 @@ Q(config__settings__enabled=True) | Backend | Nested Metadata | |---------|-----------------| -| AstraDB | โœ… Full support | -| PgVector | โœ… Full support | -| Milvus | โœ… Full support | -| ChromaDB | โŒ Flattened (auto-converted) | +| AstraDB | Full support | +| PgVector | Full support | +| Milvus | Full support | +| ChromaDB | Via dot notation | --- @@ -79,25 +79,30 @@ Q(config__settings__enabled=True) CrossVector supports 8 universal operators that work across all backends: -### Equality Operators +### Field Operators -#### `eq` - Equal +These 8 operators work on field values and are compiled to backend-specific syntax: -```python -# Explicit -Q(status__eq="active") - -# Implicit (default) -Q(status="active") -``` +| Operator | Usage | Example | +|----------|-------|---------| +| `eq` | Equal | `Q(status="active")` or `Q(status__eq="active")` | +| `ne` | Not equal | `Q(status__ne="deleted")` | +| `gt` | Greater than | `Q(score__gt=0.8)` | +| `gte` | Greater than or equal | `Q(score__gte=0.8)` | +| `lt` | Less than | `Q(price__lt=100)` | +| `lte` | Less than or equal | `Q(stock__lte=10)` | +| `in` | In array | `Q(status__in=["active", "pending"])` | +| `nin` | Not in array | `Q(status__nin=["deleted", "banned"])` | -**Filter format:** `{"status": {"$eq": "active"}}` +### Boolean Operators -#### `ne` - Not Equal +These are used to combine Q objects (not field operators): -```python -Q(status__ne="deleted") -``` +| Operator | Symbol | Example | +|----------|--------|---------| +| AND | `&` | `Q(category="tech") & Q(level="beginner")` | +| OR | `\|` | `Q(featured=True) \| Q(score__gte=0.9)` | +| NOT | `~` | `~Q(archived=True)` | **Filter format:** `{"status": {"$ne": "deleted"}}` @@ -143,10 +148,6 @@ Q(temperature__lte=30) **Filter format:** `{"stock": {"$lte": 10}}` ---- - -### Membership Operators - #### `in` - In Array ```python @@ -251,19 +252,16 @@ results = engine.search( Some backends support filtering without vector search: ```python -# AstraDB, PgVector, ChromaDB support +# AstraDB, PgVector, ChromaDB, Milvus support metadata-only search results = engine.search( query=None, # No vector search where=Q(status="published") & Q(category="tech"), limit=50 ) -# Milvus requires vector +# Always check backend support if engine.supports_metadata_only: results = engine.search(query=None, where={"status": {"$eq": "active"}}) -else: - # Provide dummy query for Milvus - results = engine.search("", where={"status": {"$eq": "active"}}) ``` ### Get Document with Filters @@ -505,10 +503,10 @@ except InvalidFieldError as e: ```python # Correct: numeric comparison with number -Q(score__gt=0.8) # โœ… +Q(score__gt=0.8) # Incorrect: numeric comparison with string (backend-dependent) -Q(score__gt="0.8") # โš ๏ธ May fail on some backends +Q(score__gt="0.8") # May fail on some backends ``` **Best Practice:** Use correct types for comparisons: @@ -531,13 +529,13 @@ Q(featured=True, archived=False) ### Index-Friendly Queries ```python -# โœ… Good: Simple equality on indexed field +# Good: Simple equality on indexed field Q(category="tech") -# โœ… Good: Range on indexed numeric field +# Good: Range on indexed numeric field Q(created_at__gte=timestamp) -# โš ๏ธ Slower: Complex nested queries +# Slower: Complex nested queries Q(user__profile__settings__theme="dark") ``` diff --git a/docs/quickstart.md b/docs/quickstart.md index 41ea6d5..4ce9fe8 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -2,7 +2,7 @@ Get started with CrossVector in 5 minutes. -> ๐Ÿ’ก **Recommended**: This guide uses **Gemini** (free tier, faster). For OpenAI, see [alternative setup](#using-openai-instead). +> **Recommended**: This guide uses **Gemini** (free tier, faster). For OpenAI, see [alternative setup](#using-openai-instead). ## Prerequisites @@ -428,8 +428,8 @@ engine = VectorEngine( ```python from crossvector.exceptions import ( MissingConfigError, - DocumentNotFoundError, - CollectionNotFoundError + DoesNotExist, + MultipleObjectsReturned ) try: @@ -442,12 +442,17 @@ try: ) except MissingConfigError as e: print(f"Configuration error: {e}") - print(f"Hint: {e.hint}") # Helpful resolution guidance + print(f"Hint: {e.details.get('hint', 'N/A')}") try: doc = engine.get(id="nonexistent-id") -except DocumentNotFoundError as e: - print(f"Document not found: {e.document_id}") +except DoesNotExist as e: + print(f"Document not found: {e.message}") + +try: + doc = engine.get(status="active") # Multiple matches +except MultipleObjectsReturned as e: + print(f"Multiple documents found: {e.message}") ``` ## Using OpenAI Instead @@ -482,15 +487,19 @@ engine = VectorEngine( # Rest of the code is identical ``` -**OpenAI Models:** -- `text-embedding-3-small` (1536 dims, default) -- `text-embedding-3-large` (3072 dims) +**Embedding Models Comparison:** + +| Model | Provider | Dimensions | Cost | Speed | +|-------|----------|-----------|------|-------| +| `gemini-embedding-001` | Gemini | 768-3072 (configurable) | Free | Fast | +| `text-embedding-3-small` | OpenAI | 1536 | Paid | Slightly slower | +| `text-embedding-3-large` | OpenAI | 3072 | Paid | Slower | + +**Recommendation:** -**Trade-offs:** -- โœ… Slightly higher accuracy for some use cases -- โœ… Larger vectors (1536 vs 768 dims) -- โŒ Requires paid API key -- โŒ 1.5x slower search than Gemini +- **Start with Gemini** (free tier, fast, good quality) +- **Use OpenAI** if you need higher accuracy for specific use cases +- **Configure dimensions** with `gemini-embedding-001`: 768, 1536, or 3072 dims --- diff --git a/docs/schema.md b/docs/schema.md index 3b4e175..826fec4 100644 --- a/docs/schema.md +++ b/docs/schema.md @@ -209,25 +209,30 @@ doc = VectorDocument.from_any(existing_doc) #### `to_vector()` -Extract vector as list or numpy array. +Extract vector in various formats. ```python to_vector( - require: bool = True, - output_format: str = "list" -) -> List[float] | np.ndarray | None + require: bool = False, + output_format: Literal["dict", "json", "str", "list"] = "list" +) -> Any ``` **Parameters:** - `require`: Raise error if vector missing -- `output_format`: `"list"` or `"numpy"` +- `output_format`: Desired format: + - `"list"` (default): Python list of floats + - `"dict"`: `{"vector": [...]}` wrapper + - `"json"`: JSON string representation + - `"str"`: String representation **Examples:** ```python -vector = doc.to_vector() # List[float] -vector = doc.to_vector(output_format="numpy") # np.ndarray +vector = doc.to_vector() # [0.1, 0.2, ...] +vector = doc.to_vector(output_format="dict") # {"vector": [0.1, 0.2, ...]} +vector = doc.to_vector(output_format="json") # '[0.1, 0.2, ...]' vector = doc.to_vector(require=False) # None if missing ``` @@ -307,11 +312,11 @@ metadata = { ### Nested Metadata (Backend Support) | Backend | Nested Support | Query Format | -|---------|----------------|--------------| -| AstraDB | โœ… Full | `{"user.role": {"$eq": "admin"}}` | -| PgVector | โœ… Full | `{"user.role": {"$eq": "admin"}}` | -| ChromaDB | โŒ Flattened | `{"user.role": {"$eq": "admin"}}` (auto-flattened) | -| Milvus | โœ… Full | `{"user.role": {"$eq": "admin"}}` | +|---------|----------------|---| +| AstraDB | Full | `{"user.role": {"$eq": "admin"}}` | +| PgVector | Full | `{"user.role": {"$eq": "admin"}}` | +| ChromaDB | Via dot notation | `{"user.role": {"$eq": "admin"}}` (auto-flattened) | +| Milvus | Full | `{"user.role": {"$eq": "admin"}}` | **Example with nested metadata:** @@ -400,7 +405,7 @@ Generate UUID v4 strings. ```python from crossvector.settings import CrossVectorSettings -settings = CrossVectorSettings(PK_STRATEGY="uuid") +settings = CrossVectorSettings(PRIMARY_KEY_MODE="uuid") # Generated IDs: "a1b2c3d4-e5f6-7890-abcd-ef1234567890" ``` @@ -409,7 +414,7 @@ settings = CrossVectorSettings(PK_STRATEGY="uuid") Hash document text using SHA256. ```python -settings = CrossVectorSettings(PK_STRATEGY="hash_text") +settings = CrossVectorSettings(PRIMARY_KEY_MODE="hash_text") # Generated IDs: "5f4dcc3b5aa765d61d8327deb882cf99" doc = engine.create("Hello world") @@ -423,7 +428,7 @@ doc = engine.create("Hello world") Hash embedding vector using SHA256. ```python -settings = CrossVectorSettings(PK_STRATEGY="hash_vector") +settings = CrossVectorSettings(PRIMARY_KEY_MODE="hash_vector") # Generated IDs: "7b8e4d2a9c1f3e5d6a0b4c8e2f7d9a1b" doc = engine.create(vector=[0.1, 0.2, ...]) @@ -435,7 +440,7 @@ doc = engine.create(vector=[0.1, 0.2, ...]) Generate random 64-bit integers. ```python -settings = CrossVectorSettings(PK_STRATEGY="int64") +settings = CrossVectorSettings(PRIMARY_KEY_MODE="int64") # Generated IDs: 7234567890123456789 ``` @@ -444,7 +449,7 @@ settings = CrossVectorSettings(PK_STRATEGY="int64") Use backend's native auto-generation (if supported). ```python -settings = CrossVectorSettings(PK_STRATEGY="auto") +settings = CrossVectorSettings(PRIMARY_KEY_MODE="auto") # Backend-specific ID generation ``` @@ -460,7 +465,7 @@ def my_id_factory() -> str: return f"doc-{int(time.time())}" settings = CrossVectorSettings( - PK_STRATEGY="custom", + PRIMARY_KEY_MODE="custom", PK_FACTORY=my_id_factory ) diff --git a/mkdocs.yml b/mkdocs.yml index d182263..ab9afaa 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -34,6 +34,11 @@ nav: - Database Adapters: adapters/databases.md - Embedding Adapters: adapters/embeddings.md - API Reference: api.md + - Advanced: + - Architecture: architecture.md + - Query DSL: querydsl.md + - Schema: schema.md + - Benchmarking: benchmarking.md - Contributing: contributing.md plugins: diff --git a/scripts/benchmark/README.md b/scripts/benchmark/README.md index 3a427a3..9f30eda 100644 --- a/scripts/benchmark/README.md +++ b/scripts/benchmark/README.md @@ -132,10 +132,10 @@ python scripts/benchmark/run.py --use-fixtures scripts/benchmark/fixtures.json - ### Benefits -โœ… **Cost-effective**: No OpenAI/Gemini API calls for embedding -โœ… **Fast testing**: Static vectors generated instantly -โœ… **Reproducible**: Fixed seed ensures consistent vectors -โœ… **Scalable**: Test with 1000+ documents for pennies +- **Cost-effective**: No OpenAI/Gemini API calls for embedding +- **Fast testing**: Static vectors generated instantly +- **Reproducible**: Fixed seed ensures consistent vectors +- **Scalable**: Test with 1000+ documents for pennies ### Example Workflow diff --git a/scripts/benchmark/generate_fixtures.py b/scripts/benchmark/generate_fixtures.py index 06a68e8..8eeacae 100644 --- a/scripts/benchmark/generate_fixtures.py +++ b/scripts/benchmark/generate_fixtures.py @@ -307,27 +307,27 @@ def main(): # Generate fixtures print("\n" + "=" * 70) - print("๐Ÿš€ Generating Benchmark Fixtures") + print("Generating Benchmark Fixtures") print("=" * 70) - print(f"๐Ÿ“Š Documents: {args.docs:,}") - print(f"๐Ÿ” Queries: {args.queries:,}") - print(f"๐Ÿ’พ Output: {args.output}") - print(f"๐ŸŒฑ Seed: {args.seed}") + print(f"Documents: {args.docs:,}") + print(f"Queries: {args.queries:,}") + print(f"Output: {args.output}") + print(f"Seed: {args.seed}") if args.add_vectors: - print(f"๐Ÿค– Embedding: {args.embedding_provider.upper()}") + print(f"Embedding: {args.embedding_provider.upper()}") print("=" * 70 + "\n") - print(f"๐Ÿ“ Generating {args.docs:,} documents with nested metadata...") + print(f"Generating {args.docs:,} documents with nested metadata...") docs = generate_benchmark_docs(args.docs, seed=args.seed) - print(f"โœ… Generated {len(docs):,} documents\n") + print(f"Generated {len(docs):,} documents\n") - print(f"๐Ÿ” Generating {args.queries:,} diverse search queries...") + print(f"Generating {args.queries:,} diverse search queries...") queries = generate_search_queries(args.queries, seed=args.seed) - print(f"โœ… Generated {len(queries):,} queries\n") + print(f"Generated {len(queries):,} queries\n") # Generate vectors if requested if args.add_vectors: - print(f"๐Ÿค– Generating vectors using {args.embedding_provider.upper()} embedding...") + print(f"Generating vectors using {args.embedding_provider.upper()} embedding...") try: if args.embedding_provider == "openai": @@ -359,17 +359,17 @@ def main(): for doc, vector in zip(batch, vectors): doc["vector"] = vector - print(f"โœ… Generated {total_docs:,} vectors using {args.embedding_provider.upper()}\n") + print(f"Generated {total_docs:,} vectors using {args.embedding_provider.upper()}\n") except Exception as e: - print(f"โš ๏ธ Failed to generate vectors: {e}") + print(f"Failed to generate vectors: {e}") print(" Fixtures will be saved without vectors\n") # Calculate statistics total_text_length = sum(len(doc["text"]) for doc in docs) avg_text_length = total_text_length / len(docs) - print("๐Ÿ“Š Fixture Statistics:") + print("Fixture Statistics:") print(f" โ€ข Total documents: {len(docs):,}") print(f" โ€ข Total queries: {len(queries):,}") print(f" โ€ข Avg text length: {avg_text_length:.0f} chars") @@ -400,12 +400,12 @@ def main(): file_size_mb = output_path.stat().st_size / 1024 / 1024 print("=" * 70) - print("โœ… Fixtures Generated Successfully!") + print("Fixtures Generated Successfully!") print("=" * 70) - print(f"๐Ÿ“ File: {output_path.absolute()}") - print(f"๐Ÿ’พ Size: {file_size_mb:.1f} MB") - print(f"๐Ÿ“ Documents: {len(docs):,}") - print(f"๐Ÿ” Queries: {len(queries):,}") + print(f"File: {output_path.absolute()}") + print(f"Size: {file_size_mb:.1f} MB") + print(f"Documents: {len(docs):,}") + print(f"Queries: {len(queries):,}") print("=" * 70 + "\n") print("๐Ÿ’ก Usage in benchmark:") diff --git a/scripts/benchmark/run.py b/scripts/benchmark/run.py index 00f7461..81a69c3 100644 --- a/scripts/benchmark/run.py +++ b/scripts/benchmark/run.py @@ -24,6 +24,36 @@ # Custom output file python scripts/benchmark.py --output results/my_benchmark.md + +IMPORTANT NOTES ON BENCHMARK RESULTS: +===================================== + +Results vary significantly based on deployment environment: + +1. **PgVector**: Benchmarks are run against LOCAL PostgreSQL instance + - Provides optimal latency and consistent performance + - Results NOT comparable with cloud-hosted PgVector + - For fair comparison: deploy PgVector in same region/network as cloud backends + +2. **Cloud Backends** (AstraDB, Milvus, ChromaDB): Results affected by: + - Network latency and geographic region + - Regional proximity between client and server + - Network conditions and bandwidth availability + - Server load and resource allocation + +3. **For Fair Comparison**: + - Run benchmarks in your actual production environment + - Ensure all backends deployed in same region + - Use consistent network conditions across all backends + - Account for network latency when interpreting results + +4. **Embedding Providers**: API-based providers (OpenAI, Gemini) + - API call latency included in embedding generation time + - Batch sizes and rate limits affect overall performance + - Static vectors used during benchmark (skip embedding API calls for DB isolation) + +RECOMMENDATION: Conduct benchmarks in YOUR production environment with real network +conditions to get accurate, meaningful results for your specific use case. """ import argparse @@ -170,18 +200,18 @@ def load_fixtures_from_file( # Generate and add vectors if requested and documents lack them if add_vectors and not has_vectors: - print(f"๐Ÿ“Š Generating vectors for {len(documents)} documents...") + print(f"Generating vectors for {len(documents)} documents...") generated_vectors = generate_fixture_vectors(len(documents)) for i, doc in enumerate(documents): doc["vector"] = generated_vectors[i] has_vectors = True - print("โœ… Added vectors to all documents") - print(f"โœ… Loaded {len(documents)} documents from {fixtures_path}") + print("Added vectors to all documents") + print(f"Loaded {len(documents)} documents from {fixtures_path}") if has_vectors: - print(" โœ“ Documents include pre-computed vectors") + print("Documents include pre-computed vectors") else: print(" โ„น๏ธ Documents will need vectors computed during benchmark") - print(f"โœ… Loaded {len(queries)} search queries from {fixtures_path}") + print(f"Loaded {len(queries)} search queries from {fixtures_path}") return documents, queries @@ -207,7 +237,7 @@ def benchmark_operation(name: str, operation: callable) -> Tuple[float, Any]: return duration, result except Exception as e: duration = time.time() - start - print(f" โŒ {name} failed: {e}") + print(f"{name} failed: {e}") return duration, None @@ -334,7 +364,7 @@ def _init_openai_embedding(self) -> Optional[Any]: return OpenAIEmbeddingAdapter(model_name="text-embedding-3-small") except Exception as e: - print(f" โš ๏ธ OpenAI embedding not available: {e}") + print(f"OpenAI embedding not available: {e}") return None def _init_gemini_embedding(self) -> Optional[Any]: @@ -345,7 +375,7 @@ def _init_gemini_embedding(self) -> Optional[Any]: # Use 1536 dimensions to match OpenAI for fair comparison return GeminiEmbeddingAdapter(model_name="gemini-embedding-001", dim=1536) except Exception as e: - print(f" โš ๏ธ Gemini embedding not available: {e}") + print(f"Gemini embedding not available: {e}") return None def _init_pgvector(self, embedding: Any, collection_name: str = None) -> Optional[VectorEngine]: @@ -360,7 +390,7 @@ def _init_pgvector(self, embedding: Any, collection_name: str = None) -> Optiona store_text=True, ) except (ImportError, MissingConfigError) as e: - print(f" โš ๏ธ PgVector not available: {e}") + print(f"PgVector not available: {e}") return None def _init_astradb(self, embedding: Any, collection_name: str = None) -> Optional[VectorEngine]: @@ -375,7 +405,7 @@ def _init_astradb(self, embedding: Any, collection_name: str = None) -> Optional store_text=True, ) except (ImportError, MissingConfigError) as e: - print(f" โš ๏ธ AstraDB not available: {e}") + print(f"AstraDB not available: {e}") return None def _init_milvus(self, embedding: Any, collection_name: str = None) -> Optional[VectorEngine]: @@ -390,7 +420,7 @@ def _init_milvus(self, embedding: Any, collection_name: str = None) -> Optional[ store_text=True, ) except (ImportError, MissingConfigError) as e: - print(f" โš ๏ธ Milvus not available: {e}") + print(f"Milvus not available: {e}") return None def _init_chroma(self, embedding: Any, collection_name: str = None) -> Optional[VectorEngine]: @@ -405,7 +435,7 @@ def _init_chroma(self, embedding: Any, collection_name: str = None) -> Optional[ store_text=True, ) except (ImportError, MissingConfigError) as e: - print(f" โš ๏ธ ChromaDB not available: {e}") + print(f"ChromaDB not available: {e}") return None def cleanup_collection(self, engine: VectorEngine, backend_name: str, collection_name: str = None) -> None: @@ -413,9 +443,9 @@ def cleanup_collection(self, engine: VectorEngine, backend_name: str, collection try: engine.drop_collection(collection_name or "benchmark_test") time.sleep(0.1) - print(f" ๐Ÿงน Cleaned up {backend_name} collection") + print(f"Cleaned up {backend_name} collection") except Exception as e: - print(f" โš ๏ธ Cleanup warning for {backend_name}: {e}") + print(f"Cleanup warning for {backend_name}: {e}") def benchmark_backend( self, @@ -440,7 +470,7 @@ def benchmark_backend( Dictionary with benchmark results """ print(f"\n{'=' * 60}") - print(f"๐Ÿ”ฅ Benchmarking: {backend_name.upper()} + {embedding_name.upper()}") + print(f"Benchmarking: {backend_name.upper()} + {embedding_name.upper()}") print(f"{'=' * 60}") # Initialize engine @@ -479,15 +509,15 @@ def benchmark_backend( # Use pre-generated documents with vectors (computed once globally) # Skip duplicate generation if pre_docs provided if not pre_docs: - print(f"\n๐Ÿ“ Generating {self.num_docs} test documents...") + print(f"\nGenerating {self.num_docs} test documents...") test_docs = generate_documents(self.num_docs) self._precompute_doc_embeddings(test_docs, embedding) else: test_docs = pre_docs.copy() - print(f"\nโœ… Using pre-generated {self.num_docs} documents (static vectors already attached)") + print(f"Using pre-generated {self.num_docs} documents (static vectors already attached)") # 1. Bulk Create Performance - print(f"\n1๏ธโƒฃ Upsert ({self.num_docs} docs)...") + print(f"\nUpsert ({self.num_docs} docs)...") # Use conservative batch_size to satisfy provider limits (e.g., Chroma max batch 1000) duration, upserted_docs = benchmark_operation("upsert", lambda: engine.upsert(test_docs, batch_size=100)) results["upsert"] = { @@ -495,12 +525,12 @@ def benchmark_backend( "docs_per_sec": self.num_docs / duration if duration > 0 else 0, "success": upserted_docs is not None, } - print(f" โœ… Duration: {format_duration(duration)}") - print(f" ๐Ÿ“Š {results['upsert']['docs_per_sec']:.2f} docs/sec") + print(f"Duration: {format_duration(duration)}") + print(f"{results['upsert']['docs_per_sec']:.2f} docs/sec") # 2. Individual Create Performance (small sample) sample_size = min(10, self.num_docs) - print(f"\n2๏ธโƒฃ Individual Create ({sample_size} docs)...") + print(f"\nIndividual Create ({sample_size} docs)...") individual_times = [] # Generate additional vectors for individual creates (from pre-computed static vectors) dim = getattr(embedding, "dim", 1536) @@ -520,10 +550,10 @@ def benchmark_backend( "avg_duration": avg_create, "sample_size": sample_size, } - print(f" โœ… Avg Duration: {format_duration(avg_create)}") + print(f"Avg Duration: {format_duration(avg_create)}") # 3. Vector Search Performance - print("\n3๏ธโƒฃ Vector Search (10 queries with pre-computed vectors)...") + print("\nVector Search (10 queries with pre-computed vectors)...") search_queries = [ "programming languages", "machine learning", @@ -557,12 +587,12 @@ def benchmark_backend( "avg_duration": avg_search, "queries": len(search_queries), } - print(f" โœ… Avg Duration: {format_duration(avg_search)}") - print(f" ๐Ÿ“Š {len(search_queries) / sum(search_times) if sum(search_times) > 0 else 0:.2f} queries/sec") + print(f"Avg Duration: {format_duration(avg_search)}") + print(f"{len(search_queries) / sum(search_times) if sum(search_times) > 0 else 0:.2f} queries/sec") # 4. Metadata-Only Search (if supported) if engine.supports_metadata_only: - print("\n4๏ธโƒฃ Metadata Search (10 queries)...") + print("\nMetadata Search (10 queries)...") metadata_times = [] for i in range(10): duration, _ = benchmark_operation( @@ -579,13 +609,13 @@ def benchmark_backend( "queries": len(metadata_times), "supported": True, } - print(f" โœ… Avg Duration: {format_duration(avg_metadata)}") + print(f" Avg Duration: {format_duration(avg_metadata)}") else: results["metadata_search"] = {"supported": False} - print("\n4๏ธโƒฃ Metadata Search: Not supported") + print("\nMetadata Search: Not supported") # 4.5. Query DSL Operators Test (using Q objects) - print("\n4๏ธโƒฃ.5 Query DSL Operators (Q objects)...") + print("\nQuery DSL Operators (Q objects)...") from crossvector.querydsl import Q # For slow backends (astradb, milvus), test fewer operators @@ -610,7 +640,7 @@ def benchmark_backend( ), ), ] - print(" โ„น๏ธ Testing 4 key operators (slow backend optimization)") + print("Testing 4 key operators (slow backend optimization)") else: # Test all operators for fast backends operator_tests = [ @@ -654,7 +684,7 @@ def benchmark_backend( operator_times.append(duration) successful_operators += 1 except Exception as e: - print(f" โš ๏ธ Operator {op_name} skipped: {e}") + print(f" Operator {op_name} skipped: {e}") if operator_times: avg_operator = sum(operator_times) / len(operator_times) @@ -664,13 +694,13 @@ def benchmark_backend( "total_operators": len(operator_tests), } print( - f" โœ… Avg Duration: {format_duration(avg_operator)} ({successful_operators}/{len(operator_tests)} operators)" + f"Avg Duration: {format_duration(avg_operator)} ({successful_operators}/{len(operator_tests)} operators)" ) else: results["query_dsl_operators"] = {"supported": False} # 5. Update Performance (use all docs) - print(f"\n5๏ธโƒฃ Update Operations ({self.num_docs} updates)...") + print(f"\nUpdate Operations ({self.num_docs} updates)...") update_sample = min(self.num_docs, len(upserted_docs) if upserted_docs else 0) if upserted_docs and update_sample > 0: update_times = [] @@ -686,12 +716,12 @@ def benchmark_backend( "avg_duration": avg_update, "sample_size": update_sample, } - print(f" โœ… Avg Duration: {format_duration(avg_update)}") + print(f" Avg Duration: {format_duration(avg_update)}") else: results["update"] = {"error": "No documents to update"} # 6. Delete Performance (all docs, batched) - print(f"\n6๏ธโƒฃ Delete Operations ({self.num_docs} deletes)...") + print(f"\nDelete Operations ({self.num_docs} deletes)...") delete_sample = min(self.num_docs, len(upserted_docs) if upserted_docs else 0) if upserted_docs and delete_sample > 0: batch_size = 100 @@ -707,18 +737,18 @@ def benchmark_backend( "sample_size": delete_sample, "docs_per_sec": delete_sample / total_duration if total_duration > 0 else 0, } - print(f" โœ… Duration: {format_duration(total_duration)}") - print(f" ๐Ÿ“Š {results['delete']['docs_per_sec']:.2f} docs/sec") + print(f"Duration: {format_duration(total_duration)}") + print(f"{results['delete']['docs_per_sec']:.2f} docs/sec") else: results["delete"] = {"error": "No documents to delete"} # 7. Count operation remaining_count = engine.count() results["final_count"] = remaining_count - print(f"\n๐Ÿ“Š Final document count: {remaining_count}") + print(f"\nFinal document count: {remaining_count}") except Exception as e: - print(f"\nโŒ Benchmark failed: {e}") + print(f"\nBenchmark failed: {e}") results["error"] = str(e) finally: # Cleanup - try to drop collection (gracefully ignore if fails) @@ -752,7 +782,7 @@ def batch_search( results = engine.search(query=vector, limit=search_limit) all_results.extend(results) except Exception as e: - print(f" โš ๏ธ Query failed: {e}") + print(f" Query failed: {e}") elapsed = time.time() - start_time return elapsed, all_results @@ -767,12 +797,12 @@ def run_all( query_vectors: Pre-computed search query vectors keyed by embedding provider """ print(f"\n{'=' * 60}") - print("๐Ÿš€ CrossVector Benchmark Suite") + print("CrossVector Benchmark Suite") print(f"{'=' * 60}") - print(f"๐Ÿ“Š Documents per test: {self.num_docs}") - print(f"๐ŸŽฏ Backends: {', '.join(self.backends.keys())}") - print(f"๐Ÿค– Embeddings: {', '.join(self.embedding_providers.keys())}") - print(f"โฐ Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}") + print(f"Documents per test: {self.num_docs}") + print(f"Backends: {', '.join(self.backends.keys())}") + print(f"Embeddings: {', '.join(self.embedding_providers.keys())}") + print(f"Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}") for emb_name, emb_init_func in self.embedding_providers.items(): embedding = emb_init_func() @@ -803,7 +833,7 @@ def run_all( except Exception as e: # Skip failed backends gracefully instead of crashing error_msg = str(e)[:100] - print(f"\nโš ๏ธ Skipping {backend_name}_{emb_name}: {error_msg}...") + print(f"\nSkipping {backend_name}_{emb_name}: {error_msg}...") self.results[result_key] = {"error": error_msg} return self.results @@ -839,7 +869,7 @@ def generate_markdown_report(self, output_file: str = "benchmark.md") -> None: f.write(f"**Test Results:** {success_tests}/{total_tests} passed") if error_tests > 0: - f.write(f", {error_tests} โŒ failed\n\n") + f.write(f", {error_tests} failed\n\n") else: f.write("\n\n") @@ -855,7 +885,7 @@ def generate_markdown_report(self, output_file: str = "benchmark.md") -> None: backend = result.get("backend", result_key.split("_")[0]) embedding = result.get("embedding", result_key.split("_")[1] if "_" in result_key else "unknown") error_msg = result["error"][:50] + "..." if len(result["error"]) > 50 else result["error"] - f.write(f"| {backend} | {embedding} | - | - | - | - | - | - | โŒ {error_msg} |\n") + f.write(f"| {backend} | {embedding} | - | - | - | - | - | - | ERROR: {error_msg} |\n") continue backend = result.get("backend", "unknown") @@ -879,11 +909,11 @@ def generate_markdown_report(self, output_file: str = "benchmark.md") -> None: else format_duration(delete_entry.get("duration", 0)) ) - status_icon = "โœ…" + status_icon = "OK" if (isinstance(update_entry, dict) and "error" in update_entry) or ( isinstance(delete_entry, dict) and "error" in delete_entry ): - status_icon = "โš ๏ธ" + status_icon = "WARNING" f.write( f"| {backend} | {embedding} | {model} | {dim} | {bulk_create} | {search} | {update} | {delete} | {status_icon} |\n" @@ -898,7 +928,7 @@ def generate_markdown_report(self, output_file: str = "benchmark.md") -> None: f.write(f"## {backend.upper()} + {embedding.upper()} Details\n\n") if "error" in result: - f.write(f"โŒ **Error:** {result['error']}\n\n") + f.write(f"**Error:** {result['error']}\n\n") continue # Embedding info @@ -982,7 +1012,7 @@ def generate_markdown_report(self, output_file: str = "benchmark.md") -> None: # Error Summary Section error_results = {k: v for k, v in self.results.items() if "error" in v} if error_results: - f.write("## Failed Tests โŒ\n\n") + f.write("## Failed Tests\n\n") for result_key, result in error_results.items(): backend = result.get("backend", result_key.split("_")[0]) embedding = result.get("embedding", result_key.split("_")[1] if "_" in result_key else "unknown") @@ -997,7 +1027,27 @@ def generate_markdown_report(self, output_file: str = "benchmark.md") -> None: f.write("- Times are averaged over multiple runs for stability\n") f.write("- Different embedding providers may have different dimensions and performance characteristics\n") - print(f"\n๐Ÿ“„ Markdown report saved to: {output_path}") + f.write("\n## Important: Benchmark Results Interpretation\n\n") + f.write("**PgVector Local vs Cloud Backends:**\n") + f.write("- **PgVector results**: Benchmarked against LOCAL PostgreSQL instance\n") + f.write(" - Provides optimal latency with minimal network overhead\n") + f.write(" - Results are NOT directly comparable with cloud-hosted PgVector\n") + f.write("- **Cloud Backends** (AstraDB, Milvus, ChromaDB): Performance affected by:\n") + f.write(" - Network latency and geographic region\n") + f.write(" - Regional proximity between client and server\n") + f.write(" - Network conditions and bandwidth availability\n") + f.write(" - Server load and resource allocation\n\n") + f.write("**For Fair Comparison:**\n") + f.write("- Deploy all backends in the SAME REGION and NETWORK ENVIRONMENT\n") + f.write("- Conduct benchmarks in YOUR PRODUCTION ENVIRONMENT with real network conditions\n") + f.write("- Account for network latency when interpreting and comparing results\n") + f.write("- PgVector: Consider cloud-hosted options for fair comparison (e.g., AWS RDS, Azure Database)\n\n") + f.write("**Recommendation:**\n") + f.write("These results are specific to this test environment. For your use case, run benchmarks\n") + f.write("in your actual production deployment with backends in the same region and network\n") + f.write("conditions to get accurate, meaningful performance metrics.\n") + + print(f"\nMarkdown report saved to: {output_path}") def main(): @@ -1065,24 +1115,24 @@ def main(): if not fixture_path.exists(): print(f"\n{'=' * 60}") - print(f"๐Ÿ“ Fixture not found: {fixture_path}") - print(f"๐Ÿš€ Auto-generating fixtures with {embedding_provider.upper()} embeddings...") + print(f"Fixture not found: {fixture_path}") + print(f"Auto-generating fixtures with {embedding_provider.upper()} embeddings...") print(f"{'=' * 60}\n") # Generate fixtures directly using imported functions try: # Generate documents and queries (same for all providers) num_queries = min(args.num_docs // 10, 100) - print(f"๐Ÿ“ Generating {args.num_docs:,} documents with nested metadata...") + print(f"Generating {args.num_docs:,} documents with nested metadata...") docs = generate_benchmark_docs(args.num_docs, seed=42) - print(f"โœ… Generated {len(docs):,} documents\n") + print(f"Generated {len(docs):,} documents\n") - print(f"๐Ÿ” Generating {num_queries:,} diverse search queries...") + print(f"Generating {num_queries:,} diverse search queries...") queries = generate_search_queries(num_queries, seed=42) - print(f"โœ… Generated {len(queries):,} queries\n") + print(f"Generated {len(queries):,} queries\n") # Generate vectors using embedding provider - print(f"๐Ÿค– Generating vectors using {embedding_provider.upper()} embedding...") + print(f"Generating vectors using {embedding_provider.upper()} embedding...") if embedding_provider == "openai": from crossvector.embeddings.openai import OpenAIEmbeddingAdapter @@ -1092,8 +1142,8 @@ def main(): embedding_adapter = GeminiEmbeddingAdapter(model_name="gemini-embedding-001", dim=1536) - print(f" Model: {embedding_adapter.model_name}") - print(f" Dimension: {embedding_adapter.dim}") + print(f"Model: {embedding_adapter.model_name}") + print(f"Dimension: {embedding_adapter.dim}") # Generate vectors in batches batch_size = 500 @@ -1102,13 +1152,13 @@ def main(): batch = docs[i : i + batch_size] batch_texts = [doc["text"] for doc in batch] print( - f" Processing batch {i // batch_size + 1}/{(total_docs + batch_size - 1) // batch_size} ({len(batch)} docs)..." + f"Processing batch {i // batch_size + 1}/{(total_docs + batch_size - 1) // batch_size} ({len(batch)} docs)..." ) vectors = embedding_adapter.get_embeddings(batch_texts) for doc, vector in zip(batch, vectors): doc["vector"] = vector - print(f"โœ… Generated {total_docs:,} vectors using {embedding_provider.upper()}\n") + print(f"Generated {total_docs:,} vectors using {embedding_provider.upper()}\n") # Save to file fixture_path.parent.mkdir(parents=True, exist_ok=True) @@ -1132,11 +1182,11 @@ def main(): json.dump(fixtures, f, indent=2) file_size_mb = fixture_path.stat().st_size / 1024 / 1024 - print(f"โœ… Fixtures saved: {fixture_path} ({file_size_mb:.1f} MB)\n") + print(f"Fixtures saved: {fixture_path} ({file_size_mb:.1f} MB)\n") except Exception as e: - print(f"โš ๏ธ Failed to generate fixtures for {embedding_provider}: {e}") - print(" Will try to continue with other providers...\n") + print(f"Failed to generate fixtures for {embedding_provider}: {e}") + print("Will try to continue with other providers...\n") # Load test documents from fixtures (if using --use-fixtures) # Otherwise, fixtures were already generated above for each provider @@ -1147,24 +1197,24 @@ def main(): if args.use_fixtures: fixture_path = Path(args.use_fixtures) if fixture_path.exists(): - print(f"๐Ÿ“‚ Loading fixtures from {fixture_path}...") + print(f"Loading fixtures from {fixture_path}...") try: test_docs, search_queries = load_fixtures_from_file( str(fixture_path), args.num_docs, add_vectors=args.add_vectors ) print(f"{'=' * 60}") except (FileNotFoundError, json.JSONDecodeError) as e: - print(f"โŒ Failed to load fixtures: {e}") + print(f"Failed to load fixtures: {e}") print(f"{'=' * 60}") return else: - print(f"โŒ Fixtures file not found: {fixture_path}") + print(f"Fixtures file not found: {fixture_path}") print(f"{'=' * 60}") return else: # Fixtures already generated per provider above # We'll load them per-provider in the benchmark loop - print("๐Ÿ“ Fixtures generated for each embedding provider") + print("Fixtures generated for each embedding provider") print(f"{'=' * 60}") # Create BenchmarkRunner instance @@ -1188,7 +1238,7 @@ def main(): docs, queries = load_fixtures_from_file(str(fixture_path), args.num_docs, add_vectors=False) fixtures_by_provider[emb_name] = {"docs": docs, "queries": queries} except Exception as e: - print(f"โš ๏ธ Failed to load fixtures for {emb_name}: {e}") + print(f"Failed to load fixtures for {emb_name}: {e}") else: # Use the same fixtures for all providers (from --use-fixtures) for emb_name in runner.embedding_providers.keys(): @@ -1203,17 +1253,17 @@ def main(): dim = getattr(embedding, "dim", 1536) queries_list = fixtures_by_provider[emb_name]["queries"] query_vectors_by_embedding[emb_name] = runner._precompute_query_vectors(queries_list, dim) - print(f"โœ… Pre-computed query vectors for {emb_name} (dim={dim})") + print(f"Pre-computed query vectors for {emb_name} (dim={dim})") # Run benchmarks with fixtures per provider # Modified run_all to handle per-provider fixtures print(f"\n{'=' * 60}") - print("๐Ÿš€ CrossVector Benchmark Suite") + print("CrossVector Benchmark Suite") print(f"{'=' * 60}") - print(f"๐Ÿ“Š Documents per test: {args.num_docs}") - print(f"๐ŸŽฏ Backends: {', '.join(runner.backends.keys())}") - print(f"๐Ÿค– Embeddings: {', '.join(runner.embedding_providers.keys())}") - print(f"โฐ Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}") + print(f"Documents per test: {args.num_docs}") + print(f"Backends: {', '.join(runner.backends.keys())}") + print(f"Embeddings: {', '.join(runner.embedding_providers.keys())}") + print(f"Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}") for emb_name, emb_init_func in runner.embedding_providers.items(): embedding = emb_init_func() @@ -1246,14 +1296,14 @@ def main(): except Exception as e: # Skip failed backends gracefully instead of crashing error_msg = str(e)[:100] - print(f"\nโš ๏ธ Skipping {backend_name}_{emb_name}: {error_msg}...") + print(f"\nSkipping {backend_name}_{emb_name}: {error_msg}...") runner.results[result_key] = {"error": error_msg} # Generate report runner.generate_markdown_report(output_file=args.output) print(f"\n{'=' * 60}") - print("โœ… Benchmark completed!") + print("Benchmark completed!") print(f"{'=' * 60}\n")