|
| 1 | +# How to run the scripts in this tutorial |
| 2 | + |
| 3 | +The scripts in this tutorial will help you: |
| 4 | + |
| 5 | +* Create a fleet of 3 AKS (Azure Kubernetes Service) clusters for running LLM inference workloads and routing LLM queries. |
| 6 | +* Put the 3 clusters under the management of KubeFleet, a CNCF sandbox project for multi-cluster management, with an |
| 7 | +additional KubeFleet hub cluster (also an AKS cluster) as the management portal. |
| 8 | +* Set up KAITO, a CNCF sandbox project for easy LLM usage, on the clusters for facilitating LLM workloads with ease. |
| 9 | +* Connect the 3 clusters with an Istio service mesh. |
| 10 | +* Use Kubernetes Gateway API with Inference Extension for serving LLM queries. |
| 11 | + |
| 12 | +> Note that even though the scripts are set to use AKS clusters and related resources for simplicity reasons; the tutorial itself is not necessarily Azure specific. It can run on any Kubernetes environment, as long as inter-cluster connectivity can be established. |
| 13 | +
|
| 14 | +## Before you begin |
| 15 | + |
| 16 | +* This tutorial assumes that you are familiar with basic Azure/AKS usage and Kubernetes usage. |
| 17 | +* If you don't have an Azure account, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account) before you begin. |
| 18 | +* Make sure that you have the following tools installed in your environment: |
| 19 | + * The Azure CLI (`az`). |
| 20 | + * The Kubernetes CLI (`kubectl`). |
| 21 | + * Helm |
| 22 | + * Docker |
| 23 | + * The Istio CLI (istioctl) |
| 24 | + * Go runtime (>=1.24) |
| 25 | + * `git` |
| 26 | + * `base64` |
| 27 | + * `make` |
| 28 | + * `curl` |
| 29 | +* The setup in the tutorial requires usage of GPU-enabled nodes (with NVIDIA A100 GPUs or similar specs). |
| 30 | + |
| 31 | +## Run the scripts |
| 32 | + |
| 33 | +Switch to the current directory and follow the steps below to run the scripts: |
| 34 | + |
| 35 | +```sh |
| 36 | +chmod +x setup.sh |
| 37 | +./setup.sh |
| 38 | +``` |
| 39 | + |
| 40 | +It may take a while for the setup to complete. |
| 41 | + |
| 42 | +The script includes some configurable parameters; in most cases though, you should be able to just use |
| 43 | +the default values. See the list of parameters at the file `setup.sh`, and, if needed, set up |
| 44 | +environment variables accordingly to override the default values. |
| 45 | + |
| 46 | +## Verify the setup |
| 47 | + |
| 48 | +After the setup script completes, follow the steps below to verify the setup: |
| 49 | + |
| 50 | +* Switch to one of the clusters that is running the inference workload: |
| 51 | + |
| 52 | + ```sh |
| 53 | + MEMBER_1="${MEMBER_1:-model-serving-cluster-1}" |
| 54 | + MEMBER_2="${MEMBER_2:-model-serving-cluster-2}" |
| 55 | + MEMBER_3="${MEMBER_3:-query-routing-cluster}" |
| 56 | + MEMBER_1_CTX=$MEMBER_1-admin |
| 57 | + MEMBER_2_CTX=$MEMBER_2-admin |
| 58 | + MEMBER_3_CTX=$MEMBER_3-admin |
| 59 | + |
| 60 | + kubectl config use-context $MEMBER_1_CTX |
| 61 | + kubectl get workspace |
| 62 | + ``` |
| 63 | + |
| 64 | + You should see that the KAITO workspace with the DeepSeek model is up and running. Note that it may take |
| 65 | + a while for a GPU node to get ready and have the model downloaded/set up. |
| 66 | + |
| 67 | +* Similarly, switch to the other cluster that is running the inference workload and make sure that the Phi model |
| 68 | +is up and running: |
| 69 | + |
| 70 | + ```sh |
| 71 | + kubectl config use-context $MEMBER_2_CTX |
| 72 | + kubectl get workspace |
| 73 | + ``` |
| 74 | + |
| 75 | +* Now, switch to the query routing cluster and send some queries to the inference gateway: |
| 76 | + |
| 77 | + ```sh |
| 78 | + kubectl config use-context $MEMBER_3_CTX |
| 79 | +
|
| 80 | + # Open another shell window. |
| 81 | + kubectl port-forward svc/inference-gateway-istio 10000:80 |
| 82 | +
|
| 83 | + curl -X POST http://localhost:10000/v1/chat/completions \ |
| 84 | + -H "Content-Type: application/json" \ |
| 85 | + -d '{ |
| 86 | + "model": "auto", |
| 87 | + "messages": [{"role": "user", "content": "Prove the Pythagorean theorem step by step"}], |
| 88 | + "max_tokens": 100 |
| 89 | + }' |
| 90 | + ``` |
| 91 | + |
| 92 | + You should see from the response that the query is being served by the DeepSeek model. |
| 93 | + |
| 94 | + ```sh |
| 95 | + curl -X POST -i localhost:10000/v1/chat/completions \ |
| 96 | + -H "Content-Type: application/json" \ |
| 97 | + -d '{ |
| 98 | + "model": "auto", |
| 99 | + "messages": [{"role": "user", "content": "What is the color of the sky?"}], |
| 100 | + "max_tokens": 100 |
| 101 | + }' |
| 102 | + ``` |
| 103 | + |
| 104 | + You should see from the response that the query is being served by the Phi model. |
| 105 | + |
| 106 | + > Note: the tutorial features a semantic router that classifies queries based on their categories and sends queries to a LLM that is best equipped to process the category. The process is partly non-deterministic due to the nature of LLM. If you believe that a query belongs to a specific category but is not served by the expected LLM; tweak the query text a bit and give it another try. |
| 107 | + |
| 108 | +## Additional steps |
| 109 | + |
| 110 | +You can set up the LiteLLM proxy to interact with the models using a web UI. Follow the steps in the [LiteLLM setup README](./litellm/README.md) to complete the setup. |
| 111 | + |
| 112 | +## Clean things up |
| 113 | + |
| 114 | +To clean things up, delete the Azure resource group that contains all the resources: |
| 115 | + |
| 116 | +```sh |
| 117 | +export RG="${RG:-kubefleet-kaito-demo-2025}" |
| 118 | +az group delete -n $RG |
| 119 | +``` |
| 120 | + |
| 121 | +## Questions or comments? |
| 122 | + |
| 123 | +If you have any questions or comments please using our [Q&A Discussions](https://github.com/kubefleet-dev/kubefleet/discussions/categories/q-a). |
| 124 | + |
| 125 | +If you find a bug or the solution doesn't work, please open an [Issue](https://github.com/kubefleet-dev/kubefleet/issues/new) so we can take a look. We welcome submissions too, so if you find a fix please open a PR! |
| 126 | +
|
| 127 | +Also, consider coming to a [Community Meeting](https://bit.ly/kubefleet-cm-meeting) too! |
0 commit comments