diff --git a/cloud-service-providers/google-cloud/vertexai/python/README.md b/cloud-service-providers/google-cloud/vertexai/llm/README.md similarity index 100% rename from cloud-service-providers/google-cloud/vertexai/python/README.md rename to cloud-service-providers/google-cloud/vertexai/llm/README.md diff --git a/cloud-service-providers/google-cloud/vertexai/python/imgs/vertexai_01.png b/cloud-service-providers/google-cloud/vertexai/llm/imgs/vertexai_01.png similarity index 100% rename from cloud-service-providers/google-cloud/vertexai/python/imgs/vertexai_01.png rename to cloud-service-providers/google-cloud/vertexai/llm/imgs/vertexai_01.png diff --git a/cloud-service-providers/google-cloud/vertexai/python/imgs/vertexai_02.png b/cloud-service-providers/google-cloud/vertexai/llm/imgs/vertexai_02.png similarity index 100% rename from cloud-service-providers/google-cloud/vertexai/python/imgs/vertexai_02.png rename to cloud-service-providers/google-cloud/vertexai/llm/imgs/vertexai_02.png diff --git a/cloud-service-providers/google-cloud/vertexai/python/nim-vertexai.ipynb b/cloud-service-providers/google-cloud/vertexai/llm/nim-vertexai.ipynb similarity index 100% rename from cloud-service-providers/google-cloud/vertexai/python/nim-vertexai.ipynb rename to cloud-service-providers/google-cloud/vertexai/llm/nim-vertexai.ipynb diff --git a/cloud-service-providers/google-cloud/vertexai/python/requirements.txt b/cloud-service-providers/google-cloud/vertexai/llm/requirements.txt similarity index 100% rename from cloud-service-providers/google-cloud/vertexai/python/requirements.txt rename to cloud-service-providers/google-cloud/vertexai/llm/requirements.txt diff --git a/cloud-service-providers/google-cloud/vertexai/python/samples/request.json b/cloud-service-providers/google-cloud/vertexai/llm/samples/request.json similarity index 100% rename from cloud-service-providers/google-cloud/vertexai/python/samples/request.json rename to cloud-service-providers/google-cloud/vertexai/llm/samples/request.json diff --git a/cloud-service-providers/google-cloud/vertexai/python/samples/request_stream.json b/cloud-service-providers/google-cloud/vertexai/llm/samples/request_stream.json similarity index 100% rename from cloud-service-providers/google-cloud/vertexai/python/samples/request_stream.json rename to cloud-service-providers/google-cloud/vertexai/llm/samples/request_stream.json diff --git a/cloud-service-providers/google-cloud/vertexai/nemo-retriever/README.md b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/README.md new file mode 100644 index 00000000..268e826f --- /dev/null +++ b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/README.md @@ -0,0 +1,188 @@ +# NVIDIA NeMo Retriever NIM on GCP Vertex AI + +**NVIDIA NeMo Retriever NIM** provides easy access to state-of-the-art models that are foundational building blocks for enterprise semantic search applications, delivering accurate answers quickly at scale. Developers can use these APIs to create robust copilots, chatbots, and AI assistants from start to finish. Text Retriever NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration. + +Leveraging NVIDIA’s GPU acceleration on Google Cloud Platform, NeMo Retriever NIM offers an efficient and scalable path to inference with unparalleled performance. + +This repository demonstrates NeMo Retriever Text Embedding NIM (NREM NIM) [nv-embedqa-e5-v5](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5) deploy and inference on **GCP Vertex AI** with NVIDIA GPUs. + +## Prerequisites +* [NGC API KEY](https://org.ngc.nvidia.com/setup/personal-keys) +* [NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli) +* [Vertex AI Workbench](https://cloud.google.com/vertex-ai/docs/workbench/introduction) +* [gcloud CLI](https://cloud.google.com/sdk/docs/install) + +## Run NREM NIM on Vertex AI Workbench Instance + +To deploy NREM NIM on Vertex AI, start by obtaining the required credentials and completing the prerequisite setup. + +Next, create a Vertex AI Workbench instance with GPU support. + +Once your instance is ready, follow the instructions provided in the Jupyter notebook to complete the deployment process. + +The steps are outlined below: + +* Pull NREM NIM container from NGC. +* Push NREM NIM container to Artifact Registry. +* Run NREM NIM container to make inference within interface. +* Upload NIM container as a Vertex AI Model Resource. +* Create a Vertex AI Endpoint Resource. +* Deploy the Model Resource to the Endpoint Resource. +* Generate prediction responses from Endpoint Resource. + +Finally, NREM NIM will be capable of performing inferences both locally within the notebook interface and through the Vertex AI endpoint, which can be accessed via Vertex AI `Model Registry` and `Online prediction`. + +### 1. Create a Vertex AI Workbench Instance +Create a new Vertex AI Workbench instance and select `ADVANCED OPTIONS`. Choose NVIDIA GPUs (e.g. L4 for G2 machine series) and recommended [Disk Space](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/support-matrix.html) for specific NREM NIM. + +[](HighLevelArch) + +### 2. Run NREM NIM on JupyterLab Notebook +`OPEN JUPYTERLAB` of the instance, and install required packages per `requirements.txt`. + +Run `nrem-nim-vertexai.ipynb` Python jupyter notebook, which provides step-to-step guidance on how to deploy and inference the NREM NIM container within notebook interface or via Vertex AI endpoint resource. + +If NREM NIM container has been successfully launched, you will see below output in cell or deployment log: + +```shell +========================================= +== NVIDIA Retriever Text Embedding NIM == +========================================= + +NVIDIA Release 1.0.1 (build 0cf1c7ad8e51bdffc4e4d4226735dbb7f59d70d4) +Model: nvidia/nv-embedqa-e5-v5 + +Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +This NIM container is governed by the NVIDIA AI Product Agreement here: +https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/. +A copy of this license can be found under /opt/nim/LICENSE. + +The use of this model is governed by the AI Foundation Models Community License +here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf. +``` + +### 3. Inference in Online prediction +After deploying NREM NIM container to endpoint, check Vertex AI `Model Registry` and `Online prediction` for model/endpoint version details and event logs. + +[](HighLevelArch) + +Perform endpoint inference using the OpenAI Python API or CLI. + +> [!IMPORTANT] +> Please use `rawPredict` to make endpoint inference, as `predict` method will need additional formatting. +> +Sample request and response: + +* Payload +```shell +payload_model = "nvidia/nv-embedqa-e5-v5" + +inputs = ["Hello world"] + +payload = { + "model": payload_model, + "input": inputs, + "input_type": "query" +} + +with open("request_nrem.json", "w") as outfile: + json.dump(payload, outfile) +``` + +* Python Inference +```shell +import json +from pprint import pprint +from google.api import httpbody_pb2 +from google.cloud import aiplatform_v1 + +http_body = httpbody_pb2.HttpBody( + data=json.dumps(payload).encode("utf-8"), + content_type="application/json", +) + +req = aiplatform_v1.RawPredictRequest( + http_body=http_body, endpoint=endpoint.resource_name +) + +print("Request") +print(req) +pprint(json.loads(req.http_body.data)) +print() + +API_ENDPOINT = "{}-aiplatform.googleapis.com".format(region) +client_options = {"api_endpoint": API_ENDPOINT} + +pred_client = aiplatform.gapic.PredictionServiceClient(client_options=client_options) + +response = pred_client.raw_predict(req) +print("--------------------------------------------------------------------------------------") +print("Response") +print("Length of Embeddings:", len(json.loads(response.data)['data'][0]['embedding'])) +pprint(json.loads(response.data)) +``` + + +```shell +Request +endpoint: "projects/$project_id/locations/us-central1/endpoints/$ENDPOINT_ID" +http_body { + content_type: "application/json" + data: "{\"model\": \"nvidia/nv-embedqa-e5-v5\", \"input\": [\"Hello world\"], \"input_type\": \"query\"}" +} + +{'input': ['Hello world'], + 'input_type': 'query', + 'model': 'nvidia/nv-embedqa-e5-v5'} + +-------------------------------------------------------------------------------------- +Response +Length of Embeddings: 4096 +{'data': [{'embedding': [0.0145416259765625, + 0.0167388916015625, + ... + 0.0050506591796875], + 'index': 0, + 'object': 'embedding'}], + 'model': 'nvidia/nv-embedqa-e5-v5', + 'object': 'list', + 'usage': {'prompt_tokens': 5, 'total_tokens': 5}} + ``` + + * CLI Inference +```shell +! curl \ + --request POST \ + --header "Authorization: Bearer $(gcloud auth print-access-token)" \ + --header "Content-Type: application/json" \ + https://$region-prediction-aiplatform.googleapis.com/v1/projects/$project_id/locations/$region/endpoints/$ENDPOINT_ID:rawPredict \ + --data "@request_nrem.json" +``` +```shell +{ + "object": "list", + "data": [ + { + "index": 0, + "embedding": [ + 0.0145416259765625, + 0.0167388916015625, + ... + 0.0050506591796875 + ], + "object": "embedding" + } + ], + "model": "nvidia/nv-embedqa-e5-v5", + "usage": { + "prompt_tokens": 5, + "total_tokens": 5 + } +} +``` + + ## Reference + For more information about NIM, please refer to + * [NGC User Guide](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html) + * [NVIDIA Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/index.html) + * [NeMo Text Embedding NIM API](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/reference.html) diff --git a/cloud-service-providers/google-cloud/vertexai/nemo-retriever/imgs/vertexai_01.png b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/imgs/vertexai_01.png new file mode 100644 index 00000000..3bfbd14a Binary files /dev/null and b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/imgs/vertexai_01.png differ diff --git a/cloud-service-providers/google-cloud/vertexai/nemo-retriever/imgs/vertexai_02.png b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/imgs/vertexai_02.png new file mode 100644 index 00000000..54eeec7f Binary files /dev/null and b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/imgs/vertexai_02.png differ diff --git a/cloud-service-providers/google-cloud/vertexai/nemo-retriever/nrem-nim-vertexai.ipynb b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/nrem-nim-vertexai.ipynb new file mode 100644 index 00000000..777a51ff --- /dev/null +++ b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/nrem-nim-vertexai.ipynb @@ -0,0 +1,1369 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "f72c67a4", + "metadata": {}, + "outputs": [], + "source": [ + "# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", + "# SPDX-License-Identifier: Apache-2.0\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# http://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "id": "ab89d38c-6977-4c6f-9a0f-79676a820fa1", + "metadata": {}, + "source": [ + "## Deploy NVIDIA Nemo Retriever NIM to GCP Vertex AI" + ] + }, + { + "cell_type": "markdown", + "id": "9bc36ce1-5671-470b-83b9-92a7450eefa9", + "metadata": {}, + "source": [ + "### Objective\n", + "\n", + "NVIDIA NeMo Text Retriever NIM APIs provide easy access to state-of-the-art models that are foundational building blocks for enterprise semantic search applications, delivering accurate answers quickly at scale. Developers can use these APIs to create robust copilots, chatbots, and AI assistants from start to finish. Text Retriever NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.\n", + "\n", + "- NeMo Retriever Text Embedding NIM - Boosts text question-answering retrieval performance, providing high quality embeddings for many downstream NLP tasks.\n", + "- NeMo Retriever Text Reranking NIM - Enhances the retrieval performance further with a fine-tuned reranker, finding the most relevant passages to provide as context when querying an LLM.\n", + "\n", + "In this notebook, you learn to how to run NVIDIA NeMo Retriever Text Embedding NIM (NREM NIM) container on Google Cloud Vertex AI, make inference to get customized responses, and deploy model to Vertex AI endpoint.\n", + "\n", + "This tutorial uses the following NVIDIA NREM NIM and Vertex AI services:\n", + "\n", + "- NVIDIA NREM NIM Container\n", + "- Vertex AI Model Resource\n", + "- Vertex AI Model Registry\n", + "- Vertex AI Endpoint Resource\n", + "- Vertex AI Prediction\n", + "- Vertex AI Artifact Registry\n", + "- Vertex AI Cloud Storage\n", + "\n", + "The steps performed include:\n", + "\n", + "- Pull NVIDIA NREM NIM container from NGC.\n", + "- Push NVIDIA NREM NIM container to Artifact Registry.\n", + "- Run NREM NIM container to make inference within interface.\n", + "- Upload NREM NIM container as a Vertex AI Model Resource.\n", + "- Create a Vertex AI Endpoint Resource.\n", + "- Deploy the Model Resource to the Endpoint Resource.\n", + "- Generate prediction responses from Endpoint Resource.\n" + ] + }, + { + "cell_type": "markdown", + "id": "925a8a7f-d92b-443e-984b-0902742cf5aa", + "metadata": { + "tags": [] + }, + "source": [ + "### Install and Import packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9765a47d-d7e9-4f00-acda-5c5aedfc9698", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! pip3 install -r requirements.txt" + ] + }, + { + "cell_type": "markdown", + "id": "6c5ed933-7d29-4a53-9a59-5d8bc1956153", + "metadata": {}, + "source": [ + "Restart kernel after installs so that the environment can access the new packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11544c12-8d8b-47a2-a32a-8129affab5a6", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "\n", + "app = IPython.Application.instance()\n", + "app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "oHkv14VRimlG", + "metadata": { + "executionInfo": { + "elapsed": 2680, + "status": "ok", + "timestamp": 1721243802293, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "oHkv14VRimlG", + "tags": [] + }, + "outputs": [], + "source": [ + "import google.cloud.aiplatform_v1beta1 as aip_beta\n", + "from google.cloud.aiplatform import Endpoint, Model\n", + "from google.api_core.exceptions import InvalidArgument\n", + "import requests" + ] + }, + { + "cell_type": "markdown", + "id": "84772f30-4f4c-4356-93be-4d60b17d0437", + "metadata": {}, + "source": [ + "### Authenticate to Google Cloud\n", + "Please run the following commands in a separate **Terminal** window." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df20441a-e684-4bf5-b6da-496f9664ede9", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! gcloud auth login\n", + "! gcloud auth application-default login" + ] + }, + { + "cell_type": "markdown", + "id": "b8ed8ff4-2deb-45af-ad9f-ccbd7d801a8f", + "metadata": {}, + "source": [ + "### Set Up\n", + "\n", + "The example provided is NVIDIA Retrieval QA E5 Embedding v5 NIM (`nv-embedqa-e5-v5` container from NGC), on Vertex AI Workbench Notebook `g2-standard-12` instance with NVIDIA L4 GPU.\n", + "The solution is also applicable to other NeMo Retriever models, including `llama-3.2-nv-embedqa-1b-v2`, `nv-yolox-page-elements-v1`, `llama-3.2-nv-rerankqa-1b-v2`, `llama-3.2-nv-rerankqa-1b-v2`.\n", + "\n", + "\n", + "IAM role requirements:\n", + "* Vertex AI Users `(roles/aiplatform.user)` \n", + "* Artifact Registry Repository Administrator `(roles/artifactregistry.repoAdmin)` \n", + "* Storage Admin `(roles/storage.admin)`" + ] + }, + { + "cell_type": "markdown", + "id": "6ef1c574-90ae-44de-83c0-f03274959b0d", + "metadata": {}, + "source": [ + "Get account name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "vn6KU7wfjew5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 970, + "status": "ok", + "timestamp": 1721238645640, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "vn6KU7wfjew5", + "outputId": "ca54be92-1d89-43cf-96e8-25d28f7577b2", + "tags": [] + }, + "outputs": [], + "source": [ + "import requests\n", + "gcloud_token = !gcloud auth print-access-token\n", + "gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()\n", + "account_email = gcloud_tokeninfo['email']\n", + "account_name = gcloud_tokeninfo['email'].split('@')[0]\n", + "print(account_email)\n", + "print(account_name)" + ] + }, + { + "cell_type": "markdown", + "id": "300579f4-c262-4516-a1e5-a2ea0e3d7414", + "metadata": {}, + "source": [ + "Please set the value of the following variables" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d43b7fe-0d54-4c64-a87f-8a2ddbdb72ee", + "metadata": {}, + "outputs": [], + "source": [ + "region = None # please set here, e.g. us-central1\n", + "project_id = None # please set here\n", + "public_repository = None # please set here any value to name the public Artifact Registry" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "g7chyrzCF9yoWk9IFyfdOe4W", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "executionInfo": { + "elapsed": 196, + "status": "ok", + "timestamp": 1721251321866, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "g7chyrzCF9yoWk9IFyfdOe4W", + "outputId": "e2090fb2-bca8-46c9-c79b-d2369a53377f", + "tags": [] + }, + "outputs": [], + "source": [ + "private_repository = account_name\n", + "bucket_url = f\"gs://{account_name}\"\n", + "\n", + "nim_model = \"nrem:embedqa-e5-v5-1.1.1\"\n", + "# NIM in NGC\n", + "ngc_nim_image = \"nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.1.1\"\n", + "container_name = \"nv-embedqa-e5-v5\"\n", + "# NIM in Artifact Registry (AR)\n", + "public_nim_image = f\"{region}-docker.pkg.dev/{project_id}/{public_repository}/{nim_model}\"\n", + "private_nim_image = f\"{region}-docker.pkg.dev/{project_id}/{private_repository}/{nim_model}\"\n", + "\n", + "va_model_name = \"nrem-embedqa-e5-v5\"\n", + "\n", + "machine_type = \"g2-standard-12\"\n", + "accelerator_type = \"NVIDIA_L4\"\n", + "accelerator_count = 1\n", + "\n", + "endpoint_name = va_model_name+\"_endpoint\"\n", + "payload_model = \"nvidia/nv-embedqa-e5-v5\"" + ] + }, + { + "cell_type": "markdown", + "id": "53e0b9ab-8a24-4bbf-866e-fa7f9e77c70e", + "metadata": {}, + "source": [ + "Grant required IAM roles to the service account\n", + "\n", + "*Note: If \"Use default Compute Engine service account\" is selected when creating the workbench instance, Vertex AI service account is the same as Compute Engine, as example below.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02232b9b-0735-4803-8d8a-64d5fbca9344", + "metadata": {}, + "outputs": [], + "source": [ + "project_number = !gcloud projects describe {project_id} --format=\"value(projectNumber)\"\n", + "service_account = \"serviceAccount:\" + project_number[0] + \"-compute@developer.gserviceaccount.com\"\n", + "role1 = \"roles/aiplatform.user\"\n", + "role2 = \"roles/artifactregistry.repoAdmin\"\n", + "role3 = \"roles/storage.admin\"\n", + "\n", + "! gcloud projects add-iam-policy-binding {project_id} --member={service_account} --role={role1}\n", + "! gcloud projects add-iam-policy-binding {project_id} --member={service_account} --role={role2}\n", + "! gcloud projects add-iam-policy-binding {project_id} --member={service_account} --role={role3}" + ] + }, + { + "cell_type": "markdown", + "id": "4563bc1e-4ccf-4210-ab57-bacf08bb204f", + "metadata": {}, + "source": [ + "If Cloud Storage Bucket or Artifact Registry repository doesn't already exist: Run the following cell to create your bucket or repository.\n", + "\n", + "- Private Artifact Registry is to securely store NIM containers with minimum user access, for testing, validation, and maintaining a version-controlled, auditable copy.\n", + "\n", + "- Public Artifact Registry is optional, enabling more selected users to access the NIM containers, while adhering to the Principle of Least Privilege." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c6463ab-cc0e-4222-8382-4aed34455831", + "metadata": {}, + "outputs": [], + "source": [ + "! gsutil mb -l {region} -p {project_id} {bucket_url}\n", + "! gcloud artifacts repositories create {public_repository} --repository-format=docker --location={region}\n", + "! gcloud artifacts repositories create {private_repository} --repository-format=docker --location={region}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5ff1ea3-6111-4e37-b5d5-daaa97f7339f", + "metadata": {}, + "outputs": [], + "source": [ + "# (Optional) Create public AR if needed\n", + "user = 'serviceAccount:test123@example.iam.gserviceaccount.com' # Please set member to grant AR read access to, e.g. user:test-user@gmail.com, group:admins@example.com, \n", + " # serviceAccount:test123@example.domain.com, or domain:example.domain.com\n", + "! gcloud artifacts repositories add-iam-policy-binding {public_repository} --location={region} --member={user} --role=roles/artifactregistry.reader" + ] + }, + { + "cell_type": "markdown", + "id": "73d3d2c3-92ad-49c5-85a5-229602c25e06", + "metadata": {}, + "source": [ + "Initialize Vertex AI SDK for Python" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "uYMNxGozkZyj", + "metadata": { + "executionInfo": { + "elapsed": 1, + "status": "ok", + "timestamp": 1721251252748, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "uYMNxGozkZyj", + "tags": [] + }, + "outputs": [], + "source": [ + "from google.cloud import aiplatform\n", + "\n", + "aiplatform.init(project=project_id, location=region, staging_bucket=bucket_url)" + ] + }, + { + "cell_type": "markdown", + "id": "43c98fc3-d8a2-4abd-bd86-b1e120f07c4a", + "metadata": {}, + "source": [ + "GCP Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "455fad6a-e213-4e7e-baab-84f78b17fbe4", + "metadata": {}, + "outputs": [], + "source": [ + "def run_bash_cmd(cmd):\n", + " import subprocess\n", + "\n", + " if isinstance(cmd, str):\n", + " process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True, text=True)\n", + " elif isinstance(cmd, list):\n", + " process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=False, text=True)\n", + " \n", + " output, error = process.communicate()\n", + " if error:\n", + " raise Exception(error)\n", + " else:\n", + " print(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5d570257-69fd-48e3-a558-8d861db11818", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "bash_cmd = f\"\"\"\n", + " export region={region}\n", + " gcloud config set ai_platform/region {region}\n", + " gcloud config set project {project_id}\n", + " gcloud auth configure-docker {region}-docker.pkg.dev\n", + " \"\"\"\n", + "run_bash_cmd(bash_cmd)" + ] + }, + { + "cell_type": "markdown", + "id": "4cb66464-949b-4998-b9fc-620acba7fd8a", + "metadata": { + "tags": [] + }, + "source": [ + "### NIM Container\n", + "\n", + "* **NGC_API_KEY**\n", + "\n", + "To access NIM container from NGC catalog, `NGC_API_KEY` is required.\n", + "\n", + "The credentail will be used in Vertex AI as an environment variable during model uploading, and will show on Model Registry Version Details UI. **Attention: the credential will be visible for all Vertex AI users in the same project.**\n", + "\n", + "Please upload a json file to Cloud Storage Bucket to use `read_key()` function below, format `\"{NGC_API_KEY\": Your Key}\"`.\n", + "\n", + "Reference: [NGC User Guide](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html)\n", + "\n", + "* **Artifact Registry**\n", + "\n", + "We will pull NIM container from NGC, then push to a private AR repository. \n", + "\n", + "(Optional) Then we pull NIM container from the private AR and push to a public AR repository, which allows more users in the project able to access NIM. \n" + ] + }, + { + "cell_type": "markdown", + "id": "e3d023de-30f9-4922-80ba-d9ffce1d74d4", + "metadata": {}, + "source": [ + "#### Set NGC API KEY" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "X0s70ghC-gY_", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 383, + "status": "ok", + "timestamp": 1721238662144, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "X0s70ghC-gY_", + "outputId": "7ef6b24a-ca64-44b9-ecdb-8c3c5af7c47f", + "tags": [] + }, + "outputs": [], + "source": [ + "NGC_API_KEY = None # please set here" + ] + }, + { + "cell_type": "markdown", + "id": "6b460cda-e7f7-4e14-83c5-4707e2b3770c", + "metadata": {}, + "source": [ + "#### Pull NIM from NGC and Push to GCP AR" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "kI-V4mFQF0DW", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 3277, + "status": "ok", + "timestamp": 1721243311334, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "kI-V4mFQF0DW", + "outputId": "ce78f943-0702-482c-bc83-a53ef531920e", + "tags": [] + }, + "outputs": [], + "source": [ + "# Login to NGC\n", + "from pathlib import Path\n", + "local_nim_cache=str(Path(\".cache/nim\").absolute())\n", + "\n", + "bash_cmd = f\"\"\"\n", + " sudo apt-get install -y nvidia-docker2\n", + " export NGC_API_KEY={NGC_API_KEY}\n", + " echo \"export NGC_API_KEY={NGC_API_KEY}\" >> ~/.bashrc\n", + " echo \"$NGC_API_KEY\" | docker login nvcr.io --username '$oauthtoken' --password-stdin\n", + "\n", + " export LOCAL_NIM_CACHE={local_nim_cache}\n", + " mkdir -p \"$LOCAL_NIM_CACHE\"\n", + " echo \"Local NIM cache created\"\n", + " \"\"\"\n", + "\n", + "run_bash_cmd(bash_cmd)\n", + "\n", + "# Pull NIM container from NGC and run container\n", + "docker_cmd = [\n", + " \"docker\", \"run\", \"-d\", \"--rm\",\n", + " f\"--name={container_name}\",\n", + " \"--gpus\", \"all\",\n", + " \"-e\", f\"{NGC_API_KEY}\",\n", + " \"-v\", f\"{local_nim_cache}:/opt/nim/.cache\",\n", + " \"-p\", \"8000:8000\",\n", + " ngc_nim_image\n", + "]\n", + "\n", + "print(f\"NIM container {ngc_nim_image} pulled from NGC successfully, running container is\")\n", + "run_bash_cmd(docker_cmd)\n", + "\n", + "# Push NIM container to private AR repository\n", + "bash_cmd = f\"\"\"\n", + " docker tag {ngc_nim_image} {private_nim_image}\n", + "\n", + " docker push {private_nim_image}\n", + " \"\"\"\n", + "\n", + "run_bash_cmd(bash_cmd)\n", + "print(f\"NIM container {ngc_nim_image} pushed to Artifact Registry {private_nim_image} successfully\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c8afe765-f66a-44eb-83a0-30c27b6403c7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Optional\n", + "# Push NIM container to public AR repository\n", + "bash_cmd = f\"\"\"\n", + " docker tag {private_nim_image} {public_nim_image}\n", + "\n", + " docker push {public_nim_image}\n", + " \"\"\"\n", + "\n", + "run_bash_cmd(bash_cmd)\n", + "print(f\"NIM container {private_nim_image} pushed to Artifact Registry {public_nim_image} successfully\")" + ] + }, + { + "cell_type": "markdown", + "id": "5b3a6114-4105-4c86-88e9-246d6b43ed09", + "metadata": {}, + "source": [ + "### Run NIM Container Within Interface" + ] + }, + { + "cell_type": "markdown", + "id": "e2932919-6f68-449c-8184-571036a6798e", + "metadata": {}, + "source": [ + "Run NREM NIM container locally in **Terminal** or **Another notebook**, keep the container active, then inference with Python OpenAI API or CLI command to get model responses in the Notebook interface." + ] + }, + { + "cell_type": "markdown", + "id": "a62fcae6-e470-4fc4-9409-d4cef67a1cf1", + "metadata": {}, + "source": [ + "Terminal" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b3ab906-9047-4af0-8016-a29302950df1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Run this command here, used for the following variable definition\n", + "print(private_nim_image)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d35fb809-3f7f-44ba-a97f-4121564c646d", + "metadata": {}, + "outputs": [], + "source": [ + "# If Terminal, use $variable_name, add export commands\n", + "export container_name=nv-embedqa-e5-v5\n", + "export NGC_API_KEY=None # please set here\n", + "export local_nim_cache=~/.cache/nim \n", + "export private_nim_image=None # please set here\n", + "\n", + "docker run -it --rm --name=$container_name \\\n", + " --runtime=nvidia \\\n", + " --gpus all \\\n", + " --shm-size=16GB \\\n", + " -e NGC_API_KEY=$NGC_API_KEY \\\n", + " -v $local_nim_cache\":/opt/nim/.cache\" \\\n", + " -u $(id -u) \\\n", + " -p 8000:8000 \\\n", + " $private_nim_image" + ] + }, + { + "cell_type": "markdown", + "id": "92e77562-277e-41a6-8d44-5d178d3922a5", + "metadata": {}, + "source": [ + "Notebook" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29a3c57f-6388-491b-b42e-a773a32017aa", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# If Notebook, use {variable_name}, add variables definition\n", + "container_name = \"nv-embedqa-e5-v5\"\n", + "NGC_API_KEY = None # please set here\n", + "local_nim_cache = \"~/.cache/nim\" \n", + "private_nim_image = None # please set here\n", + "\n", + "! docker run -it --rm --name={container_name} \\\n", + " --runtime=nvidia \\\n", + " --gpus all \\\n", + " --shm-size=16GB \\\n", + " -e NGC_API_KEY={NGC_API_KEY} \\\n", + " -v {local_nim_cache}\":/opt/nim/.cache\" \\\n", + " -u $(id -u) \\\n", + " -p 8000:8000 \\\n", + " {private_nim_image}" + ] + }, + { + "cell_type": "markdown", + "id": "d9a971c7-be58-4fad-8739-1b58515a0489", + "metadata": {}, + "source": [ + "Run below commands in the current notebook interface." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce051b0f-5d8d-4183-a601-b51781459db3", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! docker images | grep nrem" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94a0672d-5c48-4a6b-a47c-9f97c6f30908", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! docker ps \n", + "! echo \"\"\n", + "CONTAINER_ID = !docker ps | awk 'NR>1 {print $1}'\n", + "CONTAINER_ID = CONTAINER_ID[0]\n", + "! echo 'Running Container is' $CONTAINER_ID\n", + "! echo 'IP Address'\n", + "# ! docker inspect $CONTAINER_ID\n", + "IPAddress= !docker exec $CONTAINER_ID sh -c \"hostname --ip-address\" \n", + "IPAddress=IPAddress[0]\n", + "! echo $IPAddress\n", + "! echo \"\"\n", + "! echo \"NIM Model and Profile\"\n", + "! docker inspect $CONTAINER_ID |grep -i model" + ] + }, + { + "cell_type": "markdown", + "id": "0ca20056-db4e-44aa-902a-ce76da8f38fe", + "metadata": {}, + "source": [ + "#### Make Inference within Interface\n", + "After running NREM NIM container and keeping it active, we could make inference to model and get response. NREM NIM on Vertex AI Workbench supports both OpenAI Python API and CLI.\n", + "\n", + "With the `embeddings` endpoint, `input` could be set as input text to be transformed into vectors by the model. `input_type` could be adjusted for Embedding models such as NV-Embed-QA, E5, as they operate in `passage` or `query` mode, where `passage` is used when generating embeddings during indexing, `query` is used when generating embeddings during querying.\n", + "\n", + "Since the OpenAI API does not accept `input_type` as a parameter, it is possible to add the `-query` or `-passage` suffix to the model parameter like `nv-embedqa-e5-v5-query` and not use the `input_type` field at all for OpenAI API compliance.\n", + "\n", + "*Note: May need to change IP address of URL when make request (e.g. http://172.18.0.2:8000/v1/embeddings)*\n", + "\n", + "Reference: [NVIDIA Embedding API](https://docs.api.nvidia.com/nim/reference/nvidia-nv-embedqa-e5-v5), [Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/reference.html) " + ] + }, + { + "cell_type": "markdown", + "id": "14cb0f8b-cb49-4c9d-816c-f1fd6cff5a5c", + "metadata": {}, + "source": [ + "CLI" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5b992c69-c4df-47ef-b6ab-d5e0c41889ae", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Confirm the service is ready to handle inference requests\n", + "! curl -X 'GET' 'http://localhost:8000/v1/health/ready'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41d68c90-d684-4c18-8265-bf56b3e28925", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Generate Embeddings\n", + "! curl -X \"POST\" \\\n", + " \"http://localhost:8000/v1/embeddings\" \\ \n", + " -H 'accept: application/json' \\\n", + " -H 'Content-Type: application/json' \\\n", + " -d '{ \"input\": [\"Hello world\"],\n", + " \"model\": \"nvidia/nv-embedqa-e5-v5\",\n", + " \"input_type\": \"query\"\n", + " }'" + ] + }, + { + "cell_type": "markdown", + "id": "4b22514c-e7ab-4147-af03-59a2e70d101a", + "metadata": {}, + "source": [ + "Python" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f63ae01-1f27-4205-be64-e1d05f81662a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Generate Embeddings\n", + "from openai import OpenAI\n", + "client = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-used\")\n", + "inputs = [\"Hello World\", \"Once upon a time\"]\n", + "input_type=\"-query\" #-passage\n", + "\n", + "response = client.embeddings.create(\n", + " model=payload_model+input_type,\n", + " input=inputs,\n", + ")\n", + "embeddings = response.data[0].embedding\n", + "print(len(embeddings))" + ] + }, + { + "cell_type": "markdown", + "id": "817da080-24b6-46a5-a6c5-4e1ee9aa72cd", + "metadata": {}, + "source": [ + "Stop NIM container" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d45275c-e271-4c45-ae3c-e5904547a83b", + "metadata": {}, + "outputs": [], + "source": [ + "! docker stop $CONTAINER_ID" + ] + }, + { + "cell_type": "markdown", + "id": "086c88a7-94b3-407d-ada1-4c049efb92f4", + "metadata": {}, + "source": [ + "### Endpoint Deployment\n", + "\n", + "Then we could proceed to endpoint deloyment, this will allow the model endpoint available on Vertex AI Online Prediction.\n", + "\n", + "Steps are as follows:\n", + "\n", + "* Upload NIM container as a Vertex AI Model resource.\n", + "* Create a Vertex AI Endpoint resource.\n", + "* Deploy the Model resource to the Endpoint resource.\n", + "* Generate raw prediction requests and get responses." + ] + }, + { + "cell_type": "markdown", + "id": "VADQagSOfSgN", + "metadata": { + "id": "VADQagSOfSgN" + }, + "source": [ + "#### Upload NIM as a Vertex AI Model Resource" + ] + }, + { + "cell_type": "markdown", + "id": "ZR0xiIeGfQmY", + "metadata": { + "id": "ZR0xiIeGfQmY" + }, + "source": [ + "First, we upload the NREM NIM image as a Vertex AI model resource using the `upload()` method, with the following parameters:\n", + "\n", + "* `display_name`: The human readable name for the Model resource.\n", + "* `artifact_uri`: The Cloud Storage location of the model artifacts. If the container image includes the model artifacts that you need to serve predictions, there is no need to load files from Cloud Storage.\n", + "* `parent_model`: The parent resource name of an existing model.\n", + "* `model_version_aliases`: The aliases of the model version to create.\n", + "* `model_version_description`: The description of the model version.\n", + "* `is_default_version`: Whether the model version is the default version.\n", + "\n", + "* `serving_container_image`: The serving container image to use when the model is deployed to a Vertex AI\n", + "\n", + "* `serving_container_command`: The serving binary (HTTP Server) to start up.\n", + "\n", + "* `serving_container_shared_memory_size_mb`: The shared memory is an Inter-process communication (IPC) mechanism that allows multiple processes to access and manipulate a common block of memory. The default shared memory size is 64MB. Model servers such as vLLM or NVIDIA Triton, use shared memory to cache internal data during model inferences. Also, because shared memory can be used for cross GPU communication, using more shared memory can improve performance for accelerators without NVLink capabilities (for example, L4), if the model container requires communication across GPUs. NIM generally requires a larger shared memory size than default. \n", + "\n", + "* `serving_container_environment_variables`: The environment variables specify container required settings such as authentication key. \n", + "\n", + "* `serving_container_args`: The arguments to pass to the serving binary. For example:\n", + "\n", + " -- `model_name`: The human readable name to assign to the model.\n", + "\n", + " -- `model_base_name`: Where to store the model artifacts in the container. The Vertex service sets the variable `AIP_STORAGE_URI` to where the service installed the model artifacts in the container.\n", + "\n", + " -- `rest_api_port`: The port to which to send REST based prediction requests. NREM NIM uses `8000`.\n", + "\n", + " -- `port`: The port to which to send gRPC based prediction requests. NREM NIM uses `8000`.\n", + "\n", + "* `serving_container_health_route`: The URL for the service to periodically ping for a response to verify that the serving binary is running. For NREM NIM, this will be `/v1/health/ready`.\n", + "\n", + "* `serving_container_predict_route`: The URL for the service to route REST-based prediction requests to. For NREM NIM, this will be `/v1/embeddings`.\n", + "\n", + "* `serving_container_ports`: A list of ports for the HTTP server to listen for requests. \n", + "\n", + "* `sync`: Whether to wait for the process to complete, or return immediately (async).\n", + "\n", + "Uploading a model into a Vertex Model resource may take a few moments. After completion, model will show up in Vertex AI Model Registry." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "HVFcXzZZs1Gp", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 374, + "status": "ok", + "timestamp": 1721244274689, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "HVFcXzZZs1Gp", + "outputId": "064db104-e272-456e-d377-39ebbce664f7", + "tags": [] + }, + "outputs": [], + "source": [ + "from google.api_core.future.polling import DEFAULT_POLLING\n", + "from google.cloud.aiplatform import Endpoint, Model\n", + "DEFAULT_POLLING._timeout = 360000\n", + "\n", + "print(\"NREM NIM Container:\",private_nim_image)\n", + "\n", + "models = Model.list(filter=f'displayName=\"{va_model_name}\"')\n", + "\n", + "if models:\n", + " model = models[0]\n", + "else:\n", + " model = aiplatform.Model.upload(\n", + " display_name=va_model_name,\n", + " # parent_model=\"3585596478619385856\",\n", + " is_default_version=True,\n", + " # version_aliases=[\"v2\"], \n", + " # version_description=\"This is the second version of the model\",\n", + " serving_container_image_uri=private_nim_image,\n", + " serving_container_predict_route=\"/v1/embeddings\",\n", + " serving_container_health_route=\"/v1/health/ready\",\n", + " serving_container_environment_variables={\"NGC_API_KEY\": NGC_API_KEY, \"PORT\": \"8000\", \"shm-size\":\"16GB\"},\n", + " serving_container_shared_memory_size_mb=16000,\n", + " serving_container_ports=[8000],\n", + " sync=True,\n", + " )\n", + "model.wait()\n", + "\n", + "print(\"Model:\")\n", + "print(f\"\\tDisplay name: {model.display_name}\")\n", + "print(f\"\\tResource name: {model.resource_name}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43a64dfb-010c-4967-8a10-561e166d5553", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! gcloud ai models list --region=$region --filter=\"DISPLAY_NAME ~ .*nrem.*\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b5214c6-db87-4736-a076-94d6919ef5ae", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "MODEL_ID = !gcloud ai models list --region=$region --filter=\"DISPLAY_NAME ~ .*nrem.*\" | awk 'NR>1 {print $1}'\n", + "MODEL_ID = MODEL_ID[1]\n", + "MODEL_ID" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6ed15ce-05f1-4813-b9b2-0264926ce581", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def list_model_version(model_id: str, project: str, location: str):\n", + " \"\"\"\n", + " List all model versions of a model.\n", + " Args:\n", + " model_id: The ID of the model to list. Parent resource name of the model is also accepted.\n", + " project: The project ID.\n", + " location: The region name.\n", + " Returns:\n", + " versions: List of model versions.\n", + " \"\"\"\n", + " # Initialize the client.\n", + " aiplatform.init(project=project, location=location)\n", + "\n", + " # Initialize the Model Registry resource with the ID 'model_id'.The parent_name of Model resource can be also\n", + " # 'projects//locations//models/'\n", + " model_registry = aiplatform.models.ModelRegistry(model=model_id)\n", + "\n", + " # List all model versions of the model.\n", + " versions = model_registry.list_versions()\n", + "\n", + " return versions\n", + "\n", + "list_model_version(MODEL_ID, project_id, region)" + ] + }, + { + "cell_type": "markdown", + "id": "d98b2276-e633-4d03-8fc0-6256083e9ccc", + "metadata": {}, + "source": [ + "#### Create a Vertex AI Endpoint Resource" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e6c60f1-5741-4d72-ba15-b38440ea5d90", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "endpoints = Endpoint.list(filter=f'displayName=\"{endpoint_name}\"')\n", + "if endpoints:\n", + " endpoint = endpoints[0]\n", + "else:\n", + " print(f\"Endpoint {endpoint_name} doesn't exist, creating...\")\n", + " endpoint = aiplatform.Endpoint.create(display_name=endpoint_name)\n", + "print(\"Endpoint:\")\n", + "print(f\"\\tDisplay name: {endpoint.display_name}\")\n", + "print(f\"\\tResource name: {endpoint.resource_name}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8162ce0f-4aa7-4117-b880-d712ba803546", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! gcloud ai endpoints list --region=$region --filter=\"DISPLAY_NAME ~ .*nrem.*\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "682ab40a-3b35-4380-ad1e-5aa123bdbe84", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ENDPOINT_ID = !gcloud ai endpoints list --region=$region --filter=\"DISPLAY_NAME ~ .*nrem.*\" | awk 'NR>1 {print $1}'\n", + "ENDPOINT_ID = ENDPOINT_ID[1]\n", + "ENDPOINT_ID" + ] + }, + { + "cell_type": "markdown", + "id": "a45fc29c-6940-4cec-930e-f3303e5bc804", + "metadata": {}, + "source": [ + "#### Deploy Model Resource to Endpoint Resource" + ] + }, + { + "cell_type": "markdown", + "id": "82e8a803-a784-471a-86a0-d989df1ef85e", + "metadata": {}, + "source": [ + "Next, deploy the Vertex AI model resource to the endpoint resource with the following parameters:\n", + "\n", + "* `deploy_model_display`: The human reable name for the deployed model.\n", + "\n", + "* `traffic_split`: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs.\n", + " * If only one model, then specify `{ \"0\": 100 }`, where \"0\" refers to this model being uploaded and 100 means 100% of the traffic.\n", + " * If there are existing models on the endpoint, for which the traffic is split, then use model_id to specify `{ \"0\": percent, model_id: percent, ... }`, where model_id is the ID of an existing deployed model on the endpoint. The percentages must add up to 100.\n", + "\n", + "* `machine_type`: The machine type for each VM node instance.\n", + "\n", + "* `min_replica_count`: The minimum number of nodes to provision for auto-scaling.\n", + "\n", + "* `max_replica_count`: The maximum number of nodes to provision for auto-scaling.\n", + "\n", + "* `accelerator_type`: The type, if any, of GPU accelators per provisioned node.\n", + "\n", + "* `accelrator_count`: The number, if any, of GPU accelators per provisioned node.\n", + "\n", + "After successful deployment, the endpoint and associated deloyed model will be available on Vertex AI Online Prediction." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7j5UICB4DSXZ", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 461 + }, + "executionInfo": { + "elapsed": 1364148, + "status": "error", + "timestamp": 1721245643103, + "user": { + "displayName": "", + "userId": "" + }, + "user_tz": 240 + }, + "id": "7j5UICB4DSXZ", + "outputId": "470f4ccc-2fb6-4c74-9141-21da5c2f56d8", + "tags": [] + }, + "outputs": [], + "source": [ + "model.deploy(\n", + " endpoint=endpoint,\n", + " deployed_model_display_name=va_model_name,\n", + " traffic_percentage=100,\n", + " machine_type=machine_type,\n", + " min_replica_count=1,\n", + " max_replica_count=1,\n", + " accelerator_type=accelerator_type,\n", + " accelerator_count=accelerator_count,\n", + " enable_access_logging=True,\n", + " sync=True,\n", + ")\n", + "\n", + "print(f\"Model {model.display_name} deployed at endpoint {endpoint.display_name}.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8rLOVGXHIVbF", + "metadata": { + "id": "8rLOVGXHIVbF" + }, + "outputs": [], + "source": [ + "print(endpoint.gca_resource)\n", + "endpoint_name = endpoint.resource_name\n", + "print(endpoint_name)\n", + "print(endpoint.list_models())" + ] + }, + { + "cell_type": "markdown", + "id": "7ac1e9d1-ddf0-4fdd-b0f0-66f67fa543fb", + "metadata": {}, + "source": [ + "#### Endpoint Inference" + ] + }, + { + "cell_type": "markdown", + "id": "969ed41d-fa6a-4919-8625-a1a45fff2131", + "metadata": {}, + "source": [ + "Use the Endpoint object's `rawPredict` function to get responses from the deployed model, which accepts request that matches directly the input format of the model.\n", + "\n", + "If use the alternative `Predict` function, it will take the following parameters:\n", + "\n", + "* `instances`: A list of messages or prompts instances. Each instance should be an array of strings. \n", + "* `parameters`: A list of LLM model parameteres, e.g. temperature, max_tokens, top_p, stream.\n", + "\n", + "NREM NIM on Vertex AI Workbench supports both OpenAI Python API and CLI. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea11e507-54f9-4204-9422-590117ca6ec8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Create payload request\n", + "inputs = [\"Hello world\"]\n", + "\n", + "payload = {\n", + " \"model\": payload_model,\n", + " \"input\": inputs,\n", + " \"input_type\": \"query\"\n", + "}\n", + "\n", + "with open(\"request_nrem.json\", \"w\") as outfile: \n", + " json.dump(payload, outfile)" + ] + }, + { + "cell_type": "markdown", + "id": "2be7abd1-1ee0-47a1-8a09-27ac071d7511", + "metadata": {}, + "source": [ + "Python SDK" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e331a4fc-988a-4907-9213-fffc4236e1a7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import json\n", + "from pprint import pprint\n", + "from google.api import httpbody_pb2\n", + "from google.cloud import aiplatform_v1\n", + "\n", + "http_body = httpbody_pb2.HttpBody(\n", + " data=json.dumps(payload).encode(\"utf-8\"),\n", + " content_type=\"application/json\",\n", + ")\n", + "\n", + "req = aiplatform_v1.RawPredictRequest(\n", + " http_body=http_body, endpoint=endpoint.resource_name\n", + ")\n", + "\n", + "print(\"Request\")\n", + "print(req)\n", + "pprint(json.loads(req.http_body.data))\n", + "print()\n", + "\n", + "API_ENDPOINT = \"{}-aiplatform.googleapis.com\".format(region)\n", + "client_options = {\"api_endpoint\": API_ENDPOINT}\n", + "\n", + "pred_client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)\n", + "\n", + "response = pred_client.raw_predict(req)\n", + "print(\"--------------------------------------------------------------------------------------\")\n", + "print(\"Response\")\n", + "print(\"Length of Embeddings:\", len(json.loads(response.data)['data'][0]['embedding']))\n", + "pprint(json.loads(response.data))" + ] + }, + { + "cell_type": "markdown", + "id": "2784d83f-6daa-4b69-8602-5fe829960a3f", + "metadata": {}, + "source": [ + "CLI" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57ea7b8c-286f-4e1b-b810-431982e57ebb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! curl \\\n", + " --request POST \\\n", + " --header \"Authorization: Bearer $(gcloud auth print-access-token)\" \\\n", + " --header \"Content-Type: application/json\" \\\n", + " https://$region-prediction-aiplatform.googleapis.com/v1/projects/$project_id/locations/$region/endpoints/$ENDPOINT_ID:rawPredict \\\n", + " --data \"@request_nrem.json\"" + ] + }, + { + "cell_type": "markdown", + "id": "iMk921f1QKeK", + "metadata": { + "id": "iMk921f1QKeK" + }, + "source": [ + "### Clean Up" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "Y_YbwDwxQJ1G", + "metadata": { + "id": "Y_YbwDwxQJ1G" + }, + "outputs": [], + "source": [ + "delete_endpoint = True\n", + "delete_model = True\n", + "delete_image = True\n", + "delete_art_repo = False\n", + "delete_bucket = False\n", + "\n", + "# Undeploy model and delete endpoint\n", + "try:\n", + " if delete_endpoint:\n", + " endpoint.undeploy_all(sync=True)\n", + " endpoint.delete()\n", + " print(f\"Deleted endpoint {endpoint.display_name}\")\n", + "except Exception as e:\n", + " print(e)\n", + "\n", + "# Delete the model resource\n", + "try:\n", + " if delete_model:\n", + " model.delete()\n", + " print(f\"Deleted model {model.display_name}\")\n", + "except Exception as e:\n", + " print(e)\n", + "\n", + "# Delete the container image from Artifact Registry\n", + "if delete_image:\n", + " !gcloud artifacts docker images delete --quiet --delete-tags {private_nim_image}\n", + "\n", + "# Delete the Artifact Repository\n", + "if delete_art_repo:\n", + " ! gcloud artifacts repositories delete {private_repository} --location={region} -q\n", + "\n", + "# Delete the Cloud Storage bucket\n", + "if delete_bucket:\n", + " ! gsutil rm -rf {bucket_url}" + ] + } + ], + "metadata": { + "colab": { + "name": "VertexAI NIM Deployment", + "provenance": [] + }, + "environment": { + "kernel": "conda-base-py", + "name": "workbench-notebooks.m123", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m123" + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "conda-base-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/cloud-service-providers/google-cloud/vertexai/nemo-retriever/requirements.txt b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/requirements.txt new file mode 100644 index 00000000..80aa6635 --- /dev/null +++ b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/requirements.txt @@ -0,0 +1,8 @@ +google-api-core==2.23.0 +google-api-python-client==2.154.0 +google-auth==2.36.0 +google-cloud-aiplatform==1.73.0 +google-cloud-artifact-registry==1.13.1 +google-cloud-storage==2.18.2 +openai==1.55.2 +requests diff --git a/cloud-service-providers/google-cloud/vertexai/nemo-retriever/samples/request_nrem.json b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/samples/request_nrem.json new file mode 100644 index 00000000..a79b4c36 --- /dev/null +++ b/cloud-service-providers/google-cloud/vertexai/nemo-retriever/samples/request_nrem.json @@ -0,0 +1 @@ +{"model": "nvidia/nv-embedqa-e5-v5", "input": ["Hello word"], "input_type": "query"} \ No newline at end of file