diff --git a/autovec_unstructured/__frontmatter__.md b/autovec_unstructured/__frontmatter__.md new file mode 100644 index 00000000..a33b8fb8 --- /dev/null +++ b/autovec_unstructured/__frontmatter__.md @@ -0,0 +1,18 @@ +--- +# frontmatter +path: "/tutorial-couchbase-autovectorization-langchain" +title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services +short_title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets +description: + - Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your unstructured data into vector embeddings. + - This tutorial demonstrates how to set up automated embedding generation workflows and perform semantic search using LangChain. +content_type: tutorial +filter: sdk +technology: + - Artificial Intelligence +tags: + - LangChain +sdk_language: + - python +length: 20 Mins +--- diff --git a/autovec_unstructured/autovec_unstructured.ipynb b/autovec_unstructured/autovec_unstructured.ipynb new file mode 100644 index 00000000..9e959bb7 --- /dev/null +++ b/autovec_unstructured/autovec_unstructured.ipynb @@ -0,0 +1,385 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6f623039", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "# Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services \n", + "This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services Auto-Vectorization feature to automatically convert your unstructured data stored in S3 buckets to import it in Capella and convert it into vector embeddings and perform semantic search using LangChain." + ] + }, + { + "cell_type": "markdown", + "id": "a4d47a8a", + "metadata": {}, + "source": [ + "# 1. Create and Deploy Your Operational cluster on Capella\n", + "To get started with Couchbase Capella, create an account and use it to deploy a cluster. \n", + "\n", + "Make sure that you deploy a `Multi-node` cluster with `data`, `index`, `query` and `eventing` services enabled. To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + " \n", + "### Couchbase Capella Configuration\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "- Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket you will be using for this tutorial (e.g., `Unstructured_data_bucket`) with Read and Write permissions.\n", + "- [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "id": "a08bd871-e20d-4362-b5c1-765737894c65", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "# 2. Deploying the Model\n", + "Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us.\n", + "## 2.1: Selecting the Model \n", + "1. To select the model, you first need to navigate to the \"AI Services\" tab, then select \"Models\" and click on \"Deploy New Model\".\n", + " \n", + " \n", + "\n", + "2. Enter the model name, and choose the model that you want to deploy. After selecting your model, choose the model infrastructure and region where the model will be deployed.\n", + " \n", + " \n", + "\n", + "## 2.2: Access Control to the Model\n", + "\n", + "1. After deploying the model, go to the \"Models\" tab in the AI Services and click on \"Setup Access\".\n", + "\n", + " \n", + "\n", + "2. Enter your API key name, expiration time and the IP address from which you will be accessing the model.\n", + "\n", + " \n", + "\n", + "3. Download your API key\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "e7552113", + "metadata": {}, + "source": [ + "# 3. Data upload from S3 bucket to Couchbase (with chunking and vectorization)" + ] + }, + { + "cell_type": "markdown", + "id": "fc1b64dd-495b-4358-b732-d01856713b70", + "metadata": {}, + "source": [ + "In order to import unstructured data from the S3 bucket, you need to create a workflow that connects to your S3 bucket and chunks your unstructured data before importing it into the collections. To do so, please follow the steps mentioned below:\n", + "1) Let's start by creating a new workflow. This can be done by clicking on the `AI Services` tab, then click on `Workflows`, and then click on `Create New Workflow`.\n", + " \n", + " \n", + " \n", + "2) Start your workflow deployment by giving it a name and selecting where your data will be provided to the Auto-Vectorization service. There are currently three options: `pre-processed data (JSON format) from Capella`, `pre-processed data (JSON format) from external sources (S3 buckets)` and `unstructured data from external sources (S3 buckets)`. For this tutorial, we will choose the third option, which is unstructured data from external sources (S3 buckets). After selecting the workflow enter the workflow name and click on `Start Workflow`.\n", + " \n", + " \n", + "\n", + "3) To proceed, Capella needs to connect to your S3 bucket which will be the source of the data, and to do so click on the `+ Add New S3 Bucket`.\n", + "\n", + " \n", + "\n", + "4) Upon clicking `+ Add New S3 Bucket` a new sidebar will appear that asks for the credentials of your S3 bucket.\n", + "\n", + " \n", + " \n", + " - Enter `Integration Name`, which will be later used to select your S3 Bucket.\n", + " - Select the AWS Region where the bucket is deployed.\n", + " - Enter the name of the S3 bucket deployed in AWS.\n", + " - Enter the path where your unstructured-data is present.\n", + " - Enter your S3 bucket credentials.\n", + " - Click on ADD Credentials.\n", + "5) If the steps mentioned above are followed correctly then you should see a success pop-up as shown below and then the S3 bucket can be selected from the drop-down menu.\n", + "\n", + " \n", + "\n", + "6) On selection of the S3 bucket, various options will be displayed as described below.\n", + "\n", + " \n", + "- `Index Configuration` allows the workflow to **automatically create a Search index** on the generated embeddings. This Search index is essential for performing vector similarity searches. \n", + " - If you enable this option (recommended), the workflow will create a properly configured Search index that includes vector field mappings for your embeddings.\n", + " - If you skip this step, you'll need to manually create a Search index later before you can perform vector searches. See the [Search Index Creation Guide](https://docs.couchbase.com/server/current/search/create-search-indexes.html) below for manual setup instructions.\n", + "- `Destination Cluster` helps choose the cluster, bucket, scope and collection in which the data needs to be imported.\n", + "- `Estimated Cost` dialogue box in blue color(on the right) will show you the cost of operation per document.\n", + "- Click on `Next`.\n", + " \n", + "7) `Configure Data Preprocessing` allows you to perform various operations on the data being imported from the S3 buckets and are described below.\n", + " \n", + " \n", + "- `Page Range selection` allows you to select a custom page range when working with PDFs. (Optional)\n", + "- `Layout Exclusions` allows you to skip various unnecessary objects in your unstructured data. (Optional)\n", + "- `Object Character Recognition (OCR)` allows you to detect text from images/pdfs. (Optional)\n", + "- `Chunking Strategy` is an important step for importing data and creating embeddings(vectors) in Capella, the step will be further described below.\n", + " - `Strategy` dropdown menu helps to select the strategy that will be used to chunk the data present in S3 bucket and might be useful depending upon the data present in the S3 bucket.\n", + " - `Max Token in Chunk` decides the number of tokens that will be present in a chunk.\n", + " - `Chunk Overlap` decides the number of tokens that will overlap, this helps create context between chunks.\n", + "- Click `Next` after the options above specified are modified according to the requirement.\n", + "\n", + "8) Select the model which will be used to create the embeddings. There are two options to create the embeddings, `Capella-based` and `external model`.\n", + "\n", + " \n", + " \n", + " - For this tutorial, Capella-based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in `step 2.2` or it can be entered manually as well.\n", + " - Choices between private and insecure networking is available to choose.\n", + " - A click on `Next` will land you at the final page of the workflow.\n", + " \n", + "9) `Workflow Summary` will display all the necessary details of the workflow including `Data Source`, `Model Service`, `Unstructured Data Service` and `Billing Overview` as shown in image below.\n", + "\n", + " \n", + "\n", + "10) `Workflow Deployed` Now in the `workflow` tab we can see our workflow deployed and can check the status of our workflow. The status of the workflow run will be shown over here.\n", + "\n", + " \n", + "\n", + "\n", + " After this step, your vector embeddings for the selected fields should be ready, and you can check them out in the Capella UI. In the next step, we will demonstrate how we can use the generated vectors to perform vector search." + ] + }, + { + "cell_type": "markdown", + "id": "4f7321a7", + "metadata": {}, + "source": [ + "# 4. Vector Search Using Couchbase Search Service\n", + "\n", + "The following code cells implement semantic vector search against the embeddings generated by the Auto-Vectorization workflow. These searches are powered by **Couchbase's Search service**.\n", + "\n", + "Before you proceed, make sure the following packages are installed by running:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0298d27f-ee03-4de2-829d-b653c39746a9", + "metadata": { + "vscode": { + "languageId": "powershell" + } + }, + "outputs": [], + "source": [ + "!pip install langchain-couchbase langchain-openai" + ] + }, + { + "cell_type": "markdown", + "id": "ea920e0f-bd81-4a74-841a-86a11cb8aec4", + "metadata": {}, + "source": [ + "`langchain-couchbase - Version: 0.5.0` \\\n", + "`pip install langchain-openai - Version: 0.3.34` \n", + "\n", + "Now, please proceed to execute the cells in order to run the vector similarity search.\n", + "\n", + "# Importing Required Packages" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5e8ba0fc", + "metadata": {}, + "outputs": [], + "source": [ + "from couchbase.cluster import Cluster\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.options import ClusterOptions\n", + "\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" + ] + }, + { + "cell_type": "markdown", + "id": "4f8428f2-f923-42df-bf7d-beacf5b38f16", + "metadata": {}, + "source": [ + "# Cluster Connection Setup\n", + " - Defines the secure connection string, user credentials, and creates a `Cluster` object." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f44ea528-1ec1-41ce-90db-bdd0d87b5cff", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"CLUSTER_CONNECTION_STRING\" # Replace this with Connection String\n", + "username = \"YOUR_USERNAME\" # Replace this with your username\n", + "password = \"YOUR_PASSWORD\" # Replace this with your password\n", + "auth = PasswordAuthenticator(username, password)\n", + "\n", + "options = ClusterOptions(auth)\n", + "cluster = Cluster(endpoint, options)\n", + "\n", + "cluster.wait_until_ready(timedelta(seconds=5))" + ] + }, + { + "cell_type": "markdown", + "id": "c0874f89", + "metadata": {}, + "source": [ + "# Selection of Buckets / Scope / Collection / Index / Embedder\n", + " - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n", + " - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically during the workflow setup (step 3.6) or manually as described in the same step. You can find this index name in the **Search** tab of your Capella cluster.\n", + " - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n", + " - `open_api_key` is the api key token created in `step 2.3`.\n", + " - `open_api_base` is the Capella model services endpoint found in the models section.\n", + " - for more details visit [openAIEmbeddings](https://docs.langchain.com/oss/python/integrations/text_embedding/openai).\n", + "\n", + "`Note that the Capella AI Endpoint also requires an additional /v1 from the endpoint if not shown on the UI`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d77404b", + "metadata": {}, + "outputs": [], + "source": [ + "bucket_name = \"Unstructured_data_bucket\"\n", + "scope_name = \"_default\"\n", + "collection_name = \"_default\"\n", + "index_name = \"hyperscale_autovec_workflow_text-to-embed\" # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster.\n", + " \n", + "# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n", + "embedder = OpenAIEmbeddings(\n", + " model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n", + " openai_api_key=\"CAPELLA_MODEL_KEY\",\n", + " openai_api_base=\"CAPELLA_MODEL_ENDPOINT/v1\",\n", + " check_embedding_ctx_length=False,\n", + " tiktoken_enabled=False, \n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a1b9ac43", + "metadata": {}, + "source": [ + "# VectorStore Construction\n", + " - Creates a [CouchbaseSearchVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) instance that interfaces with **Couchbase's Search service** to perform vector similarity searches.\n", + " - The vector store:\n", + " * Knows where to read documents (`bucket/scope/collection`).\n", + " * References the Search index (`index_name`) that contains vector field mappings.\n", + " * Knows the embedding field (the vector produced by the Auto-Vectorization workflow).\n", + " * Uses the provided embedder to embed queries on-demand for similarity search.\n", + " - If your Auto-Vectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n", + " - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8efd0e80", + "metadata": {}, + "outputs": [], + "source": [ + "vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=bucket_name,\n", + " scope_name=scope_name,\n", + " collection_name=collection_name,\n", + " embedding=embedder,\n", + " index_name=index_name,\n", + " text_key=\"text-to-embed\", # Your document's text field\n", + " embedding_key=\"text-embedding\" # This is the field in which your vector (embedding) is stored in the cluster.\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "17adeeed", + "metadata": {}, + "source": [ + "# Performing a Similarity Search\n", + " - Defines a natural language query (e.g., \"How to setup java SDK?\").\n", + " - Calls `similarity_search_with_score(k=3)` to retrieve the top 3 most semantically similar documents using **Couchbase's Search service**.\n", + " - The Search service performs efficient vector similarity search using the index created earlier.\n", + " - Prints ranked results, extracting the chosen `text_key` (here `text-to-embed`).\n", + " - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n", + " - Adjust `k` for more or fewer results." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "eb87c6e6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. — Score: 0.8052 — Content: Section Title: Set Up the Java SDK\n", + "Content: Run the command mvn install to pull in all the dependencies and finish your SDK setup.\n", + "2. — Score: 0.7971 — Content: Section Title: Set Up the Java SDK\n", + "Content: To set up the Java SDK: Create the following directory structure on your computer: In the student directory, create a new file called pom. xml. Paste the following code block into your pom. xm1 file: Open a terminal window and navigate to your student directory.\n", + "3. — Score: 0.7745 — Content: Section Title: Prerequisites\n", + "Content: e You have installed the Java Software Development Kit (version 8, 11, 17, or 21). o The recommended version is the latest Java LTS release. Make sure to install the highest available patch for the LTS version.\n" + ] + } + ], + "source": [ + "query = \"How to setup java SDK?\"\n", + "results = vector_store.similarity_search_with_score(query, k=3)\n", + "\n", + "for rank, (doc, score) in enumerate(results, start=1):\n", + " text = getattr(doc, \"page_content\", None)\n", + " print(f\"{rank}. — Score: {score:.4f} — Content: {text}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b5ab91ee", + "metadata": {}, + "source": [ + "# Results and Interpretation\n", + "\n", + "As we can see, 3 (or `k`) ranked results are printed in the output.\n", + "\n", + "### What Each Part Means\n", + "- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n", + "- Content text: This is the value of the field you configured as `text_key` (in this tutorial: `text-to-embed`). It represents the human-readable content we chose to display.\n", + "\n", + "### How the Ranking Works with Search Service\n", + "1. Your natural language query (e.g., `query = \"How to setup java SDK?\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n", + "2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"text-embedding\"`).\n", + "3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n", + "\n", + "\n", + "> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language query—even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings powered by Couchbase's Search service." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/autovec_unstructured/img/S3bucketsuccess.png b/autovec_unstructured/img/S3bucketsuccess.png new file mode 100644 index 00000000..9aa75d6d Binary files /dev/null and b/autovec_unstructured/img/S3bucketsuccess.png differ diff --git a/autovec_unstructured/img/S3credentials.png b/autovec_unstructured/img/S3credentials.png new file mode 100644 index 00000000..c9a74a41 Binary files /dev/null and b/autovec_unstructured/img/S3credentials.png differ diff --git a/autovec_unstructured/img/Select_embedding_model.png b/autovec_unstructured/img/Select_embedding_model.png new file mode 100644 index 00000000..a91e6862 Binary files /dev/null and b/autovec_unstructured/img/Select_embedding_model.png differ diff --git a/autovec_unstructured/img/addS3bucket.png b/autovec_unstructured/img/addS3bucket.png new file mode 100644 index 00000000..9566d59a Binary files /dev/null and b/autovec_unstructured/img/addS3bucket.png differ diff --git a/autovec_unstructured/img/configure_data_source.png b/autovec_unstructured/img/configure_data_source.png new file mode 100644 index 00000000..d9ba090a Binary files /dev/null and b/autovec_unstructured/img/configure_data_source.png differ diff --git a/autovec_unstructured/img/data_processing.png b/autovec_unstructured/img/data_processing.png new file mode 100644 index 00000000..46cf504e Binary files /dev/null and b/autovec_unstructured/img/data_processing.png differ diff --git a/autovec_unstructured/img/deploying_model.png b/autovec_unstructured/img/deploying_model.png new file mode 100644 index 00000000..5b830341 Binary files /dev/null and b/autovec_unstructured/img/deploying_model.png differ diff --git a/autovec_unstructured/img/download_api_key_details.png b/autovec_unstructured/img/download_api_key_details.png new file mode 100644 index 00000000..8ee7dc82 Binary files /dev/null and b/autovec_unstructured/img/download_api_key_details.png differ diff --git a/autovec_unstructured/img/importing_model.png b/autovec_unstructured/img/importing_model.png new file mode 100644 index 00000000..41e80e92 Binary files /dev/null and b/autovec_unstructured/img/importing_model.png differ diff --git a/autovec_unstructured/img/model_api_key_form.png b/autovec_unstructured/img/model_api_key_form.png new file mode 100644 index 00000000..0713a53c Binary files /dev/null and b/autovec_unstructured/img/model_api_key_form.png differ diff --git a/autovec_unstructured/img/model_setup_access.png b/autovec_unstructured/img/model_setup_access.png new file mode 100644 index 00000000..91dfae79 Binary files /dev/null and b/autovec_unstructured/img/model_setup_access.png differ diff --git a/autovec_unstructured/img/start_workflow.png b/autovec_unstructured/img/start_workflow.png new file mode 100644 index 00000000..1d025b7a Binary files /dev/null and b/autovec_unstructured/img/start_workflow.png differ diff --git a/autovec_unstructured/img/workflow.png b/autovec_unstructured/img/workflow.png new file mode 100644 index 00000000..fcf8a0c6 Binary files /dev/null and b/autovec_unstructured/img/workflow.png differ diff --git a/autovec_unstructured/img/workflow_deployed.png b/autovec_unstructured/img/workflow_deployed.png new file mode 100644 index 00000000..73868796 Binary files /dev/null and b/autovec_unstructured/img/workflow_deployed.png differ diff --git a/autovec_unstructured/img/workflow_summary.png b/autovec_unstructured/img/workflow_summary.png new file mode 100644 index 00000000..1ff16489 Binary files /dev/null and b/autovec_unstructured/img/workflow_summary.png differ diff --git a/autovec_unstructured/sample.pdf b/autovec_unstructured/sample.pdf new file mode 100644 index 00000000..381c7740 Binary files /dev/null and b/autovec_unstructured/sample.pdf differ