diff --git a/huggingface/gsi/frontmatter.md b/huggingface/gsi/frontmatter.md deleted file mode 100644 index 62d17118..00000000 --- a/huggingface/gsi/frontmatter.md +++ /dev/null @@ -1,21 +0,0 @@ ---- -# frontmatter -path: "/tutorial-huggingface-couchbase-vector-search-with-global-secondary-index" -title: Using Hugging Face Embeddings with Couchbase Vector Search with GSI -short_title: Hugging Face with Couchbase Vector Search with GSI -description: - - Learn how to generate embeddings using Hugging Face and store them in Couchbase. - - This tutorial demonstrates how to use Couchbase's vector search capabilities with Hugging Face embeddings. - - You'll understand how to perform vector search to find relevant documents based on similarity with GSI. -content_type: tutorial -filter: sdk -technology: - - vector search -tags: - - GSI - - Artificial Intelligence - - Hugging Face -sdk_language: - - python -length: 30 Mins ---- diff --git a/huggingface/fts/.env.sample b/huggingface/query_based/.env.sample similarity index 100% rename from huggingface/fts/.env.sample rename to huggingface/query_based/.env.sample diff --git a/huggingface/query_based/frontmatter.md b/huggingface/query_based/frontmatter.md new file mode 100644 index 00000000..569533d6 --- /dev/null +++ b/huggingface/query_based/frontmatter.md @@ -0,0 +1,23 @@ +--- +# frontmatter +path: "/tutorial-huggingface-couchbase-vector-search-with-hyperscale-or-composite-vector-index" +alt_paths: ["/tutorial-huggingface-couchbase-vector-search-with-hyperscale-vector-index", "/tutorial-huggingface-couchbase-vector-search-with-composite-vector-index"] +title: Using Hugging Face Embeddings with Couchbase Hyperscale and Composite Vector Index +short_title: Hugging Face with Couchbase Hyperscale & Composite Index +description: + - Learn how to generate embeddings using Hugging Face and store them in Couchbase. + - This tutorial demonstrates how to use Couchbase's vector search capabilities with Hugging Face embeddings using Hyperscale and Composite Vector Indexes. + - You'll understand how to perform high-performance vector search to find relevant documents based on similarity. +content_type: tutorial +filter: sdk +technology: + - vector search +tags: + - Hyperscale Vector Index + - Composite Vector Index + - Artificial Intelligence + - Hugging Face +sdk_language: + - python +length: 30 Mins +--- diff --git a/huggingface/gsi/hugging_face.ipynb b/huggingface/query_based/hugging_face.ipynb similarity index 79% rename from huggingface/gsi/hugging_face.ipynb rename to huggingface/query_based/hugging_face.ipynb index 79d501e2..4d0783ac 100644 --- a/huggingface/gsi/hugging_face.ipynb +++ b/huggingface/query_based/hugging_face.ipynb @@ -5,29 +5,23 @@ "id": "251b8fa3", "metadata": {}, "source": [ - "# Semantic Search with Couchbase GSI Vector Search and Hugging Face" + "## Introduction" ] }, { "cell_type": "markdown", "id": "48c7d51d", "metadata": {}, - "source": [ - "## Overview" - ] - }, - { - "cell_type": "markdown", - "id": "26c2adca", - "metadata": {}, "source": [ "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Hugging Face](https://huggingface.co/) as the AI-powered embedding model provider. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval.\n", "\n", - "This tutorial demonstrates how to leverage Couchbase's **Global Secondary Index (GSI) vector search capabilities** with Hugging Face embeddings to create a high-performance semantic search system. GSI vector search in Couchbase offers significant advantages over traditional FTS (Full-Text Search) approaches, particularly for vector-first workloads and scenarios requiring complex filtering with high query-per-second (QPS) performance.\n", + "This tutorial demonstrates how to leverage Couchbase's **Hyperscale and Composite Vector Indexes** with Hugging Face embeddings to create a high-performance semantic search system. These vector indexes offer significant advantages over Search Vector Index approaches, particularly for vector-first workloads and scenarios requiring complex filtering with high query-per-second (QPS) performance.\n", + "\n", + "For more information on Hyperscale and Composite Vector Indexes, see the [Couchbase Vector Index Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", "\n", "This guide is designed to be comprehensive yet accessible, with clear step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system. Whether you're building a recommendation engine, content discovery platform, or any application requiring intelligent document retrieval, this tutorial provides the foundation you need.\n", "\n", - "**Note**: If you want to perform semantic search using the FTS (Full-Text Search) index instead, please take a look at [this alternative approach](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-fts)." + "**Note**: If you want to perform semantic search using the Search Vector Index instead, please take a look at [this alternative tutorial](https://developer.couchbase.com/tutorial-huggingface-couchbase-vector-search-with-search-vector-index).\n" ] }, { @@ -43,7 +37,7 @@ "id": "ac55ceff", "metadata": {}, "source": [ - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/huggingface/gsi/hugging_face.ipynb).\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/huggingface/query_based/hugging_face.ipynb).\n", "\n", "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." ] @@ -71,7 +65,7 @@ "metadata": {}, "outputs": [], "source": [ - "!pip install --quiet langchain-couchbase==0.5.0 transformers==4.56.1 sentence_transformers==5.1.0 langchain_huggingface==0.3.1 python-dotenv==1.1.1 ipywidgets" + "!pip install --quiet langchain-couchbase==1.0.1 transformers==4.56.1 sentence_transformers==5.1.0 langchain_huggingface python-dotenv==1.1.1 ipywidgets" ] }, { @@ -84,7 +78,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "b06f138a", "metadata": {}, "outputs": [], @@ -137,7 +131,7 @@ "source": [ "**Version Requirements:**\n", "- **Couchbase Server 8.0+** or **Couchbase Capella** with Query Service enabled\n", - "- Note: GSI vector search is a newer feature that requires Couchbase Server 8.0 or above, unlike FTS-based vector search which works with 7.6+\n", + "- Note: Hyperscale and Composite Vector Indexes require Couchbase Server 8.0 or above, unlike Search Vector Index which works with 7.6+\n", "\n", "**Access Requirements:**\n", "- A configured Bucket, Scope, and Collection\n", @@ -197,15 +191,14 @@ "source": [ "- **Python 3.8+** \n", "- Required Python packages (installed via pip in the next section):\n", - " - `langchain-couchbase==0.5.0rc1`\n", + " - `langchain-couchbase==1.0.1`\n", " - `transformers==4.56.1` \n", - " - `sentence_transformers==5.1.0`\n", - " - `langchain_huggingface==0.3.1`" + " - `sentence_transformers==5.1.0`\n" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "464ff771", "metadata": {}, "outputs": [], @@ -248,7 +241,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "id": "e0b131ed", "metadata": {}, "outputs": [], @@ -277,7 +270,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "id": "61bd6302", "metadata": {}, "outputs": [ @@ -306,7 +299,7 @@ "id": "8c546353", "metadata": {}, "source": [ - "## Understanding GSI Vector Search" + "## Understanding Hyperscale and Composite Vector Indexes" ] }, { @@ -314,7 +307,7 @@ "id": "154912ee", "metadata": {}, "source": [ - "### Optimizing Vector Search with Global Secondary Index (GSI)" + "### Optimizing Vector Search with Hyperscale and Composite Indexes" ] }, { @@ -322,7 +315,7 @@ "id": "a1de6df3", "metadata": {}, "source": [ - "With Couchbase 8.0+, you can leverage the power of GSI-based vector search, which offers significant performance improvements over traditional Full-Text Search (FTS) approaches for vector-first workloads. GSI vector search provides high-performance vector similarity search with advanced filtering capabilities and is designed to scale to billions of vectors." + "With Couchbase 8.0+, you can leverage the power of Hyperscale and Composite Vector Indexes, which offer significant performance improvements over Search Vector Index approaches for vector-first workloads. These indexes provide high-performance vector similarity search with advanced filtering capabilities and are designed to scale to billions of vectors." ] }, { @@ -330,7 +323,7 @@ "id": "81e22ebc", "metadata": {}, "source": [ - "#### GSI vs FTS: Choosing the Right Approach" + "#### Hyperscale/Composite vs Search Vector Index: Choosing the Right Approach" ] }, { @@ -338,12 +331,12 @@ "id": "e7259e62", "metadata": {}, "source": [ - "| Feature | GSI Vector Search | FTS Vector Search |\n", + "| Feature | Hyperscale & Composite Vector Index | Search Vector Index |\n", "| --------------------- | --------------------------------------------------------------- | ----------------------------------------- |\n", "| **Best For** | Vector-first workloads, complex filtering, high QPS performance| Hybrid search and high recall rates |\n", "| **Couchbase Version** | 8.0.0+ | 7.6+ |\n", - "| **Filtering** | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (BHIVE) | Pre-filtering with flexible ordering |\n", - "| **Scalability** | Up to billions of vectors (BHIVE) | Up to 10 million vectors |\n", + "| **Filtering** | Pre-filtering with `WHERE` clauses (Composite) or post-filtering (Hyperscale) | Pre-filtering with flexible ordering |\n", + "| **Scalability** | Up to billions of vectors (Hyperscale) | Up to 10 million vectors |\n", "| **Performance** | Optimized for concurrent operations with low memory footprint | Good for mixed text and vector queries |" ] }, @@ -352,7 +345,7 @@ "id": "3b5ed398", "metadata": {}, "source": [ - "#### GSI Vector Index Types" + "#### Vector Index Types" ] }, { @@ -360,7 +353,7 @@ "id": "855bf6c8", "metadata": {}, "source": [ - "Couchbase offers two distinct GSI vector index types, each optimized for different use cases:" + "Couchbase offers two distinct query-based vector index types, each optimized for different use cases:" ] }, { @@ -368,7 +361,7 @@ "id": "5c6789c3", "metadata": {}, "source": [ - "##### Hyperscale Vector Indexes (BHIVE)" + "##### Hyperscale Vector Indexes" ] }, { @@ -402,7 +395,7 @@ "- **Use when**: Your queries combine vector similarity with scalar filters that eliminate large portions of data\n", "- **Features**: \n", " - Efficient pre-filtering where scalar attributes reduce the vector comparison scope\n", - " - Best for well-defined workloads requiring complex filtering using GSI features\n", + " - Best for well-defined workloads requiring complex filtering\n", " - Supports range lookups combined with vector search" ] }, @@ -419,14 +412,14 @@ "id": "4ac316b5", "metadata": {}, "source": [ - "In this tutorial, we'll demonstrate creating a **BHIVE index** and running vector similarity queries using GSI. BHIVE is ideal for semantic search scenarios where you want:\n", + "In this tutorial, we'll demonstrate creating a **Hyperscale Vector Index** and running vector similarity queries. Hyperscale is ideal for semantic search scenarios where you want:\n", "\n", "1. **High-performance vector search** across large datasets\n", "2. **Low latency** for real-time applications\n", "3. **Scalability** to handle growing vector collections\n", "4. **Concurrent operations** for multi-user environments\n", "\n", - "The BHIVE index will provide optimal performance for our Hugging Face embedding-based semantic search implementation." + "The Hyperscale Vector Index will provide optimal performance for our Hugging Face embedding-based semantic search implementation." ] }, { @@ -467,7 +460,7 @@ "id": "4b9588f3", "metadata": {}, "source": [ - "#### Understanding GSI Index Configuration (Couchbase 8.0 Feature)" + "#### Understanding Index Configuration (Couchbase 8.0 Feature)" ] }, { @@ -475,7 +468,7 @@ "id": "528616ee", "metadata": {}, "source": [ - "Before creating our BHIVE index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization." + "Before creating our Hyperscale index, it's important to understand the configuration parameters that optimize vector storage and search performance. The `index_description` parameter controls how Couchbase optimizes vector storage through centroids and quantization." ] }, { @@ -547,7 +540,7 @@ "\n", "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html#algo_settings).\n", "\n", - "For more information on GSI vector indexes, see [Couchbase GSI Vector Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html)." + "For more information on Hyperscale and Composite Vector Indexes, see [Couchbase Vector Index Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html)." ] }, { @@ -572,12 +565,12 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "31e83fa4", "metadata": {}, "outputs": [], "source": [ - "# Create a BHIVE GSI vector index (good default: IVF,SQ8)\n", + "# Create a Hyperscale Vector Index store (good default: IVF,SQ8)\n", "vector_store = CouchbaseQueryVectorStore(\n", " cluster=cluster,\n", " bucket_name=couchbase_bucket,\n", @@ -630,7 +623,7 @@ "1. **Text Preprocessing**: The input text is preprocessed and tokenized according to the Hugging Face model's requirements\n", "2. **Vector Generation**: Each document is converted into a high-dimensional vector (embedding) that captures its semantic meaning\n", "3. **Storage**: The embeddings are stored in Couchbase along with the original text and any metadata\n", - "4. **Indexing**: The vectors are indexed using our BHIVE GSI index for efficient similarity search" + "4. **Indexing**: The vectors are indexed using our Hyperscale Vector Index for efficient similarity search" ] }, { @@ -649,7 +642,7 @@ "In this example, we're adding sample documents that demonstrate Couchbase's capabilities. The system will:\n", "- Generate embeddings for each text document using the Hugging Face model\n", "- Store them in our Couchbase collection\n", - "- Make them immediately available for semantic search once the GSI index is ready\n", + "- Make them immediately available for semantic search once the Hyperscale Vector Index is ready\n", "\n", "**Note**: The `batch_size` parameter controls how many documents are processed together, which can help optimize performance for large document sets." ] @@ -697,11 +690,11 @@ "source": [ "Now let's demonstrate the performance benefits of different optimization approaches available in Couchbase. We'll compare three optimization levels to show how each contributes to building a production-ready semantic search system:\n", "\n", - "1. **Baseline (Raw Search)**: Basic vector similarity search without GSI optimization\n", - "2. **GSI-Optimized Search**: High-performance search using BHIVE GSI index\n", + "1. **Baseline (Raw Search)**: Basic vector similarity search without Hyperscale optimization\n", + "2. **Optimized Search**: High-performance search using Hyperscale Vector Index\n", "3. **Cache Benefits**: Show how caching can be applied on top of any search approach\n", "\n", - "**Important**: Caching is orthogonal to index types - you can apply caching benefits to both raw searches and GSI-optimized searches to improve repeated query performance." + "**Important**: Caching is orthogonal to index types - you can apply caching benefits to both raw searches and optimized searches to improve repeated query performance." ] }, { @@ -748,7 +741,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "1fb710ff", "metadata": { "lines_to_next_cell": 1 @@ -794,12 +787,12 @@ "id": "07579883", "metadata": {}, "source": [ - "First, let's establish baseline performance with raw vector search - no GSI optimization yet:" + "First, let's establish baseline performance with raw vector search - no Hyperscale optimization yet:" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "id": "fd30f9c9", "metadata": {}, "outputs": [ @@ -807,7 +800,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Testing baseline performance without GSI optimization...\n", + "Testing baseline performance without Hyperscale optimization...\n", "\n", "=== PHASE 1: BASELINE VECTOR SEARCH ===\n", "Query: \"What are the key features of a scalable NoSQL database?\"\n", @@ -816,11 +809,11 @@ "\n", "[Result 1]\n", "Vector Distance: 0.586197 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.645435 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "[Result 3]\n", "Vector Distance: 0.976888 (lower = more similar)\n", @@ -830,7 +823,7 @@ ], "source": [ "test_query = \"What are the key features of a scalable NoSQL database?\"\n", - "print(\"Testing baseline performance without GSI optimization...\")\n", + "print(\"Testing baseline performance without Hyperscale optimization...\")\n", "baseline_time, baseline_results = search_with_performance_metrics(\n", " test_query, \"Phase 1: Baseline Vector Search\"\n", ")" @@ -841,7 +834,7 @@ "id": "0486ab77", "metadata": {}, "source": [ - "### Phase 2: Create BHIVE GSI Index and Test Performance" + "### Phase 2: Create Hyperscale Vector Index and Test Performance" ] }, { @@ -849,12 +842,12 @@ "id": "5758586f", "metadata": {}, "source": [ - "Now let's create the BHIVE GSI index and measure the performance improvement:" + "Now let's create the Hyperscale Vector Index and measure the performance improvement:" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 9, "id": "9e2fa28e", "metadata": {}, "outputs": [ @@ -862,24 +855,24 @@ "name": "stdout", "output_type": "stream", "text": [ - "Creating BHIVE GSI vector index...\n", - "✓ BHIVE GSI vector index created successfully!\n", + "Creating Hyperscale Vector Index...\n", + "✓ Hyperscale Vector Index created successfully!\n", "Waiting for index to become available...\n", "\n", - "Testing performance with BHIVE GSI optimization...\n", + "Testing performance with Hyperscale optimization...\n", "\n", - "=== PHASE 2: GSI-OPTIMIZED SEARCH ===\n", + "=== PHASE 2: OPTIMIZED SEARCH ===\n", "Query: \"What are the key features of a scalable NoSQL database?\"\n", "Search Time: 0.0848 seconds\n", "Results Found: 3 documents\n", "\n", "[Result 1]\n", "Vector Distance: 0.586197 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.645435 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "[Result 3]\n", "Vector Distance: 0.976888 (lower = more similar)\n", @@ -888,16 +881,16 @@ } ], "source": [ - "# Create BHIVE index for optimized vector search\n", - "print(\"Creating BHIVE GSI vector index...\")\n", + "# Create Hyperscale index for optimized vector search\n", + "print(\"Creating Hyperscale Vector Index...\")\n", "try:\n", " vector_store.create_index(\n", - " index_type=IndexType.BHIVE,\n", + " index_type=IndexType.HYPERSCALE,\n", " index_description=\"IVF,SQ8\",\n", " distance_metric=DistanceStrategy.COSINE,\n", - " index_name=\"huggingface_bhive_index\",\n", + " index_name=\"huggingface_hyperscale_index\",\n", " )\n", - " print(\"✓ BHIVE GSI vector index created successfully!\")\n", + " print(\"✓ Hyperscale Vector Index created successfully!\")\n", " \n", " # Wait for index to become available\n", " print(\"Waiting for index to become available...\")\n", @@ -905,14 +898,14 @@ " \n", "except Exception as e:\n", " if \"already exists\" in str(e).lower():\n", - " print(\"✓ BHIVE GSI vector index already exists, proceeding...\")\n", + " print(\"✓ Hyperscale Vector Index already exists, proceeding...\")\n", " else:\n", - " print(f\"Error creating GSI index: {str(e)}\")\n", + " print(f\"Error creating Hyperscale index: {str(e)}\")\n", "\n", - "# Test the same query with GSI optimization\n", - "print(\"\\nTesting performance with BHIVE GSI optimization...\")\n", - "gsi_time, gsi_results = search_with_performance_metrics(\n", - " test_query, \"Phase 2: GSI-Optimized Search\"\n", + "# Test the same query with Hyperscale optimization\n", + "print(\"\\nTesting performance with Hyperscale optimization...\")\n", + "optimized_time, optimized_results = search_with_performance_metrics(\n", + " test_query, \"Phase 2: Optimized Search\"\n", ")" ] }, @@ -929,12 +922,12 @@ "id": "f6d312ab", "metadata": {}, "source": [ - "Now let's show how caching can improve performance for repeated queries. **Note**: Caching benefits apply to both raw searches and GSI-optimized searches." + "Now let's show how caching can improve performance for repeated queries. **Note**: Caching benefits apply to both raw searches and optimized searches." ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 10, "id": "d52edb51", "metadata": {}, "outputs": [ @@ -955,11 +948,11 @@ "\n", "[Result 1]\n", "Vector Distance: 0.632770 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.677951 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "Second execution (cache hit):\n", "\n", @@ -970,11 +963,11 @@ "\n", "[Result 1]\n", "Vector Distance: 0.632770 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n", + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\n", "\n", "[Result 2]\n", "Vector Distance: 0.677951 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n" + "Document Content: It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n" ] } ], @@ -1023,7 +1016,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 11, "id": "640a1bbe", "metadata": {}, "outputs": [ @@ -1036,7 +1029,7 @@ "VECTOR SEARCH PERFORMANCE OPTIMIZATION SUMMARY\n", "================================================================================\n", "Phase 1 - Baseline (Raw Search): 0.1484 seconds\n", - "Phase 2 - GSI-Optimized Search: 0.0848 seconds\n", + "Phase 2 - Optimized Search: 0.0848 seconds\n", "Phase 3 - Cache Benefits:\n", " First execution (cache miss): 0.1024 seconds\n", " Second execution (cache hit): 0.0289 seconds\n", @@ -1044,14 +1037,14 @@ "--------------------------------------------------------------------------------\n", "OPTIMIZATION IMPACT ANALYSIS:\n", "--------------------------------------------------------------------------------\n", - "GSI Index Benefit: 1.75x faster (42.8% improvement)\n", + "Vector Index Benefit: 1.75x faster (42.8% improvement)\n", "Cache Benefit: 3.55x faster (71.8% improvement)\n", "\n", "Key Insights:\n", - "• GSI optimization provides consistent performance benefits, especially with larger datasets\n", - "• Caching benefits apply to both raw and GSI-optimized searches\n", - "• Combined GSI + Cache provides the best performance for production applications\n", - "• BHIVE indexes scale to billions of vectors with optimized concurrent operations\n" + "• Hyperscale optimization provides consistent performance benefits, especially with larger datasets\n", + "• Caching benefits apply to both raw and optimized searches\n", + "• Combined Hyperscale + Cache provides the best performance for production applications\n", + "• Hyperscale indexes scale to billions of vectors with optimized concurrent operations\n" ] } ], @@ -1061,7 +1054,7 @@ "print(\"=\"*80)\n", "\n", "print(f\"Phase 1 - Baseline (Raw Search): {baseline_time:.4f} seconds\")\n", - "print(f\"Phase 2 - GSI-Optimized Search: {gsi_time:.4f} seconds\")\n", + "print(f\"Phase 2 - Optimized Search: {optimized_time:.4f} seconds\")\n", "print(f\"Phase 3 - Cache Benefits:\")\n", "print(f\" First execution (cache miss): {cache_time_1:.4f} seconds\")\n", "print(f\" Second execution (cache hit): {cache_time_2:.4f} seconds\")\n", @@ -1070,13 +1063,13 @@ "print(\"OPTIMIZATION IMPACT ANALYSIS:\")\n", "print(\"-\"*80)\n", "\n", - "# GSI improvement analysis\n", - "if gsi_time and baseline_time and gsi_time < baseline_time:\n", - " gsi_speedup = baseline_time / gsi_time\n", - " gsi_improvement = ((baseline_time - gsi_time) / baseline_time) * 100\n", - " print(f\"GSI Index Benefit: {gsi_speedup:.2f}x faster ({gsi_improvement:.1f}% improvement)\")\n", + "# Vector Index improvement analysis\n", + "if optimized_time and baseline_time and optimized_time < baseline_time:\n", + " index_speedup = baseline_time / optimized_time\n", + " index_improvement = ((baseline_time - optimized_time) / baseline_time) * 100\n", + " print(f\"Vector Index Benefit: {index_speedup:.2f}x faster ({index_improvement:.1f}% improvement)\")\n", "else:\n", - " print(f\"GSI Index Benefit: Performance similar to baseline (may vary with dataset size)\")\n", + " print(f\"Vector Index Benefit: Performance similar to baseline (may vary with dataset size)\")\n", "\n", "# Cache improvement analysis\n", "if cache_time_2 and cache_time_1 and cache_time_2 < cache_time_1:\n", @@ -1087,10 +1080,10 @@ " print(f\"Cache Benefit: No significant improvement (results may be cached already)\")\n", "\n", "print(f\"\\nKey Insights:\")\n", - "print(f\"• GSI optimization provides consistent performance benefits, especially with larger datasets\")\n", - "print(f\"• Caching benefits apply to both raw and GSI-optimized searches\")\n", - "print(f\"• Combined GSI + Cache provides the best performance for production applications\")\n", - "print(f\"• BHIVE indexes scale to billions of vectors with optimized concurrent operations\")" + "print(f\"• Hyperscale optimization provides consistent performance benefits, especially with larger datasets\")\n", + "print(f\"• Caching benefits apply to both raw and optimized searches\")\n", + "print(f\"• Combined Hyperscale + Cache provides the best performance for production applications\")\n", + "print(f\"• Hyperscale indexes scale to billions of vectors with optimized concurrent operations\")" ] }, { @@ -1106,12 +1099,12 @@ "id": "a390746a", "metadata": {}, "source": [ - "Try your own queries with the optimized search system:" + "Try your own queries with the optimized Hyperscale search system:" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 12, "id": "26b9d9f4", "metadata": {}, "outputs": [ @@ -1120,7 +1113,7 @@ "output_type": "stream", "text": [ "\n", - "=== INTERACTIVE GSI-OPTIMIZED SEARCH ===\n", + "=== INTERACTIVE OPTIMIZED SEARCH ===\n", "Query: \"What is the sample data?\"\n", "Search Time: 0.0812 seconds\n", "Results Found: 3 documents\n", @@ -1131,11 +1124,11 @@ "\n", "[Result 2]\n", "Vector Distance: 0.860599 (lower = more similar)\n", - "Document Content: It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "Document Content: It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", "\n", "[Result 3]\n", "Vector Distance: 0.909207 (lower = more similar)\n", - "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.\n" + "Document Content: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\n" ] }, { @@ -1144,20 +1137,20 @@ "(0.08118820190429688,\n", " [(Document(id='e20a8dcd8b464e8e819b87c9a0ff05c3', metadata={}, page_content='this is a sample text with the data \"hello\"'),\n", " 0.6236441411684932),\n", - " (Document(id='0442f351aec2415481138315d492ee80', metadata={}, page_content='It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.'),\n", + " (Document(id='0442f351aec2415481138315d492ee80', metadata={}, page_content='It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.'),\n", " 0.8605992009935179),\n", - " (Document(id='7c601881e4bf4c53b5b4c2a25628d904', metadata={}, page_content='Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.'),\n", + " (Document(id='7c601881e4bf4c53b5b4c2a25628d904', metadata={}, page_content='Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.'),\n", " 0.9092065785676496)])" ] }, - "execution_count": 14, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "custom_query = input(\"Enter your search query: \")\n", - "search_with_performance_metrics(custom_query, \"Interactive GSI-Optimized Search\")\n" + "search_with_performance_metrics(custom_query, \"Interactive Optimized Search\")\n" ] }, { @@ -1175,7 +1168,7 @@ "lines_to_next_cell": 3 }, "source": [ - "You have successfully built a powerful semantic search engine using Couchbase's GSI vector search capabilities and Hugging Face embeddings. This guide has walked you through the complete process of creating a high-performance vector search system that can scale to handle billions of documents." + "You have successfully built a powerful semantic search engine using Couchbase's Hyperscale and Composite Vector Indexes with Hugging Face embeddings. This guide has walked you through the complete process of creating a high-performance vector search system that can scale to handle billions of documents." ] } ], diff --git a/huggingface/gsi/.env.sample b/huggingface/search_based/.env.sample similarity index 100% rename from huggingface/gsi/.env.sample rename to huggingface/search_based/.env.sample diff --git a/huggingface/fts/frontmatter.md b/huggingface/search_based/frontmatter.md similarity index 60% rename from huggingface/fts/frontmatter.md rename to huggingface/search_based/frontmatter.md index 4367aef0..06769005 100644 --- a/huggingface/fts/frontmatter.md +++ b/huggingface/search_based/frontmatter.md @@ -1,18 +1,18 @@ --- # frontmatter -path: "/tutorial-huggingface-couchbase-vector-search-with-fts" -title: Using Hugging Face Embeddings with Couchbase Vector Search using FTS Service -short_title: Hugging Face with Couchbase Vector Search using FTS Service +path: "/tutorial-huggingface-couchbase-vector-search-with-search-vector-index" +title: Using Hugging Face Embeddings with Couchbase Search Vector Index +short_title: Hugging Face with Couchbase Search Vector Index description: - Learn how to generate embeddings using Hugging Face and store them in Couchbase. - This tutorial demonstrates how to use Couchbase's vector search capabilities with Hugging Face embeddings. - - You'll understand how to perform vector search to find relevant documents based on similarity using FTS Service. + - You'll understand how to perform vector search to find relevant documents based on similarity using Search Vector Index. content_type: tutorial filter: sdk technology: - vector search tags: - - FTS + - Search Vector Index - Artificial Intelligence - Hugging Face sdk_language: diff --git a/huggingface/fts/hugging_face.ipynb b/huggingface/search_based/hugging_face.ipynb similarity index 77% rename from huggingface/fts/hugging_face.ipynb rename to huggingface/search_based/hugging_face.ipynb index 31b436a8..efb371aa 100644 --- a/huggingface/fts/hugging_face.ipynb +++ b/huggingface/search_based/hugging_face.ipynb @@ -5,9 +5,13 @@ "id": "4c60986a", "metadata": {}, "source": [ - "# Introduction\n", + "## Introduction\n", "\n", - "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Hugging Face](https://huggingface.co/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com//tutorial-huggingface-couchbase-vector-search-with-global-secondary-index)" + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [Hugging Face](https://huggingface.co/) as the AI-powered embedding model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval.\n", + "\n", + "This tutorial uses Couchbase's **Search Vector Index** for vector similarity search. For more information on vector indexes, see the [Couchbase Vector Index Documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).\n", + "\n", + "This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using Hyperscale or Composite Vector Indexes, please take a look at [this tutorial](https://developer.couchbase.com/tutorial-huggingface-couchbase-vector-search-with-hyperscale-or-composite-vector-index)." ] }, { @@ -15,9 +19,9 @@ "id": "6178e6b3", "metadata": {}, "source": [ - "# How to run this tutorial\n", + "## How to Run This Tutorial\n", "\n", - "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/huggingface/fts/hugging_face.ipynb).\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/huggingface/search_based/hugging_face.ipynb).\n", "\n", "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment." ] @@ -27,9 +31,9 @@ "id": "ef73d80c", "metadata": {}, "source": [ - "# Before you start\n", + "## Before You Start\n", "\n", - "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "### Create and Deploy Your Free Tier Operational Cluster on Capella\n", "\n", "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", "\n", @@ -48,12 +52,12 @@ "id": "77308721", "metadata": {}, "source": [ - "# Install necessary libraries" + "## Install Necessary Libraries" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "208a54a1", "metadata": {}, "outputs": [], @@ -66,7 +70,7 @@ "id": "9470f9e3-311b-45c8-81c3-baa5fe0995d2", "metadata": {}, "source": [ - "# Imports" + "## Imports" ] }, { @@ -98,8 +102,9 @@ "id": "041a3edf-f5f7-43e1-99b9-b775e94fbfe6", "metadata": {}, "source": [ - "# Prerequisites\n", - "In order to run this tutorial, you will need access to a Couchbase Cluster with Full Text Search service either through Couchbase Capella or by running it locally and have credentials to acces a collection on that cluster:" + "## Prerequisites\n", + "\n", + "In order to run this tutorial, you will need access to a Couchbase Cluster with Search Service enabled either through Couchbase Capella or by running it locally, and have credentials to access a collection on that cluster:" ] }, { @@ -126,7 +131,8 @@ "id": "15edfec2-64bd-4ba1-b072-4fadacddb01a", "metadata": {}, "source": [ - "# Couchbase Connection\n", + "## Couchbase Connection\n", + "\n", "In this section, we first need to create a `PasswordAuthenticator` object that would hold our Couchbase credentials:" ] }, @@ -182,8 +188,13 @@ "id": "625881d5-39e2-44ed-bbca-0db67e98f765", "metadata": {}, "source": [ - "# Creating Couchbase Vector Search Index\n", - "In order to store generated with Hugging Face embeddings onto a Couchbase Cluster, a vector search index needs to be created first. We included a sample index definition that will work with this tutorial in a file named `huggingface_index.json` located in the folder with this tutorial. The definition can be used to create a vector index using Couchbase server web console, on more information on vector indexes, please read [Create a Vector Search Index with the Server Web Console](https://docs.couchbase.com/server/current/vector-search/create-vector-search-index-ui.html). Please note that the index is configured for documents from bucket `hugginface`, scope `_default` and collection `huggingface` and you will have to edit `source` and document type name in the index definition file if your collection, scope or bucket names are different.\n", + "## Creating Couchbase Search Vector Index\n", + "\n", + "In order to store Hugging Face-generated embeddings onto a Couchbase Cluster, a Search Vector Index needs to be created first. We included a sample index definition that will work with this tutorial in a file named `huggingface_index.json` located in the folder with this tutorial.\n", + "\n", + "The definition can be used to create a Search Vector Index using Couchbase server web console. For more information on vector indexes, please read [Create a Vector Search Index with the Server Web Console](https://docs.couchbase.com/server/current/vector-search/create-vector-search-index-ui.html).\n", + "\n", + "Please note that the index is configured for documents from bucket `huggingface`, scope `_default` and collection `huggingface`. You will need to edit the `source` and document type name in the index definition file if your collection, scope, or bucket names are different.\n", "\n", "Here, our code verifies the existence of the index and will throw an exception if the index has not been found:" ] @@ -213,7 +224,7 @@ "id": "d71a7207-54d1-44fd-aa9d-d361b42d2c96", "metadata": {}, "source": [ - "# Hugging Face Initialization" + "## Hugging Face Initialization" ] }, { @@ -240,8 +251,9 @@ "id": "c0d8e261-d670-4c40-8037-3d4e3084c360", "metadata": {}, "source": [ - "# Embedding Documents\n", - "After initializing Hugging Face transformers library, it can be used to generate vector embeddings for user input or predefined set of phrases. Here, we're generating 2 embeddings for contained in the array strings:" + "## Embedding Documents\n", + "\n", + "After initializing the Hugging Face transformers library, it can be used to generate vector embeddings for user input or a predefined set of phrases. Here, we're generating embeddings for the strings contained in the array:" ] }, { @@ -266,8 +278,9 @@ "id": "80814e90-699f-4201-8cd3-7ef8adab9966", "metadata": {}, "source": [ - "# Storing Embeddings in Couchbase\n", - "Generated embeddings are then stored as vector fields inside documents that can contain additional information about the vector, including the original text. The documents are then upserted onto the couchbase cluster:" + "## Storing Embeddings in Couchbase\n", + "\n", + "Generated embeddings are then stored as vector fields inside documents that can contain additional information about the vector, including the original text. The documents are then upserted onto the Couchbase cluster:" ] }, { @@ -291,8 +304,9 @@ "id": "f11a0d98-bcf5-4fe4-b602-6e8a23edf95e", "metadata": {}, "source": [ - "# Searching For Embeddings\n", - "After the documents are upserted onto the cluster, their vector fields will be added into previously imported vector index. Later, new embeddings can be added or used to perform a similarity search on the previously added documents:" + "## Searching For Embeddings\n", + "\n", + "After the documents are upserted onto the cluster, their vector fields will be added to the previously imported Search Vector Index. Later, new embeddings can be added or used to perform a similarity search on the previously added documents:" ] }, { diff --git a/huggingface/fts/huggingface_index.json b/huggingface/search_based/huggingface_index.json similarity index 100% rename from huggingface/fts/huggingface_index.json rename to huggingface/search_based/huggingface_index.json