Skip to content

Commit f8df0a7

Browse files
authored
Bedrock fix (#5642)
* fix(bedrock): Poll for model Active status before creating deployment Add _wait_for_model_active() to poll get_custom_model until the model reaches Active status before calling create_custom_model_deployment. This fixes ValidationException when the custom model is not yet ready for deployment after create_custom_model returns. * feat(bedrock): Harden BedrockModelBuilder for production readiness Extract _is_nova_model() helper to eliminate duplicated Nova detection logic across deploy() and _get_s3_artifacts(). Uses getattr with safe defaults instead of fragile hasattr chains. Add input validation to deploy() and create_deployment(): - Raise ValueError when model_package is not set - Raise ValueError when custom_model_name or role_arn missing for Nova deployments - Raise ValueError when model_arn is empty in create_deployment Move json and urlparse imports to module level (were previously imported inside _get_checkpoint_uri_from_manifest). Replace f-string logging with lazy %s formatting throughout. Initialize status=None before the polling loop in _wait_for_model_active to avoid UnboundLocalError if the loop body never executes. Rewrite unit tests (43 tests) with full coverage: - _is_nova_model: recipe_name, hub_content_name, case insensitivity, missing base_model, None fields - __init__: None model, TrainingJob, ModelPackage - Client singletons: caching, injection - _fetch_model_package: ModelPackage, TrainingJob, ModelTrainer, unknown type - _get_s3_artifacts: None package, non-Nova, Nova delegation, Nova fallback - _get_checkpoint_uri_from_manifest: success, missing key, NoSuchKey, not TrainingJob, no artifacts, invalid JSON - _wait_for_model_active: immediate, polling, Failed, timeout - create_deployment: polling chain, extra kwargs, empty/None ARN - deploy: non-Nova, Nova full chain, hub_content_name detection, default deployment name, tags, missing params, None stripping Add integration tests for Nova E2E deployment: - Training job existence and status verification - Builder creation and Nova detection via _is_nova_model - S3 artifacts checkpoint validation - Full deploy-with-polling flow (marked @pytest.mark.slow) - Timeout behavior on bogus ARN - Validation error paths (no model_package, empty model_arn) - Resource cleanup fixture for deployments and custom models * feat(bedrock): Add deployment status polling after CreateCustomModelDeployment Previously create_deployment() only polled for the custom model to reach Active status before calling CreateCustomModelDeployment, but did not wait for the deployment itself to become Active. This caused callers to receive a deployment ARN that was still in Creating state, requiring manual polling in user code. Add _wait_for_deployment_active() that polls get_custom_model_deployment until status reaches Active, raises RuntimeError on Failed, and times out after max_wait seconds (default 3600s, poll interval 30s). Wire it into create_deployment() so the full flow is now: 1. _wait_for_model_active (poll model creation) 2. create_custom_model_deployment (API call) 3. _wait_for_deployment_active (poll deployment creation) Gracefully skips deployment polling if the API response does not contain a customModelDeploymentArn. Unit tests (48 passing): - _wait_for_deployment_active: immediate Active, polling, Failed status, timeout - create_deployment: full model+deployment polling chain, skip polling when no ARN in response - deploy Nova chain: updated to verify deployment polling * fix(integ): Fix region handling and add get-or-create Nova training job The TestModelCustomizationDeployment integ tests were failing with DescribeTrainingJob 'Requested resource not found' because the SageMaker SDK caches the first session's region internally. The session-scoped cleanup_e2e_endpoints fixture (autouse) was creating a session in us-east-1 (default) before the class fixtures could set us-west-2, causing all subsequent TrainingJob.get calls to hit the wrong region. Fix by setting AWS_DEFAULT_REGION=us-west-2 in the cleanup_e2e_endpoints fixture before any SageMaker session is created. Add tests/integ/conftest.py with a session-scoped nova_training_job_name fixture that implements get-or-create: - Checks if sdk-integ-nova-micro-sft exists and is Completed - If InProgress, waits for completion - If not found, uploads minimal training data to S3 and launches a Nova Micro SFT training job via SFTTrainer - Reused across test_bedrock_nova_e2e.py and TestBedrockNovaDeployment in test_model_customization_deployment Update both Nova test files to use the shared fixture instead of hardcoded training job names. * fix(integ): Use SAGEMAKER_REGION for cross-region training job lookup The SageMaker SDK's SageMakerClient reads SAGEMAKER_REGION env var at init time and caches the region for all subsequent API calls. The cleanup_e2e_endpoints session fixture was the first to create a SageMakerClient (in the default region), which then poisoned all subsequent TrainingJob.get calls regardless of the region parameter. Fix by setting SAGEMAKER_REGION=us-west-2 in cleanup_e2e_endpoints before any SDK session is created, since all resources in this test file live in us-west-2. The env var is restored after cleanup. In CodeBuild (us-west-2) this is a no-op since the default region already matches. The other test files (triton, tei, tgi) are not affected since they have their own fixtures and don't import from this file. * refactor(integ): Replace Nova integ tests with example notebook Remove us-east-1 Nova integration tests that cannot run in the us-west-2 CodeBuild environment: - Delete tests/integ/test_bedrock_nova_e2e.py - Delete tests/integ/conftest.py (Nova get-or-create fixture) - Remove TestBedrockNovaDeployment class from test_model_customization_deployment.py Add example_notebooks/bedrock_nova_deployment.ipynb covering the full Nova workflow: SFTTrainer fine-tuning, BedrockModelBuilder deploy with model+deployment polling, inference, and cleanup. The BedrockModelBuilder source code and unit tests (48 passing) are unchanged. The us-west-2 integ tests for non-Nova Bedrock deployment (TestModelCustomizationDeployment) remain. * docs: Add Bedrock model builder example notebooks Add notebooks demonstrating Bedrock deployment workflows: - bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment via BedrockModelBuilder with SFTTrainer fine-tuning - boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment - model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment - 07-ml-model-development(1).ipynb: ML model development workflow - sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb: Clean Nova deployment example with polling, inference, and cleanup * docs: Add Bedrock model builder example notebooks Add notebooks demonstrating Bedrock deployment workflows: - bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment via BedrockModelBuilder with SFTTrainer fine-tuning - boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment - model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment - 07-ml-model-development(1).ipynb: ML model development workflow - sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb: Clean Nova deployment example with polling, inference, and cleanup * fix(serve): Update Nova Bedrock deployment notebook with working e2e flow Simplify notebook to use existing completed training job with BedrockModelBuilder deploy flow. Fix Nova inference content format to use array of {text: ...} objects. Remove broken SFTTrainer cells that fail due to botocore service model mismatch.
1 parent 3e1aef8 commit f8df0a7

File tree

8 files changed

+5575
-369
lines changed

8 files changed

+5575
-369
lines changed

07-ml-model-development(1).ipynb

Lines changed: 1260 additions & 0 deletions
Large diffs are not rendered by default.

bedrock-modelbuilder-deployment-nova.ipynb

Lines changed: 568 additions & 0 deletions
Large diffs are not rendered by default.

boto3_deployment_notebook.ipynb

Lines changed: 1202 additions & 0 deletions
Large diffs are not rendered by default.

model_builder_deployment_notebook(1).ipynb

Lines changed: 1492 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Deploy a Fine-Tuned Nova Model to Amazon Bedrock\n",
8+
"\n",
9+
"This notebook demonstrates how to deploy a fine-tuned Amazon Nova model to\n",
10+
"Amazon Bedrock using `BedrockModelBuilder`.\n",
11+
"\n",
12+
"The workflow:\n",
13+
"1. Retrieve a completed Nova SFT training job\n",
14+
"2. Create a `BedrockModelBuilder` from the training job\n",
15+
"3. Deploy to Bedrock — the builder automatically:\n",
16+
" - Detects the model as Nova\n",
17+
" - Reads the checkpoint URI from the training job manifest\n",
18+
" - Calls `CreateCustomModel` and polls until Active\n",
19+
" - Calls `CreateCustomModelDeployment` and polls until Active\n",
20+
"4. Test inference\n",
21+
"5. Clean up resources\n",
22+
"\n",
23+
"**Prerequisites:**\n",
24+
"- AWS credentials with SageMaker and Bedrock access\n",
25+
"- `sagemaker-serve` package installed\n",
26+
"- A completed Nova SFT training job\n",
27+
"- An IAM role with Bedrock and SageMaker permissions"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"metadata": {},
33+
"source": [
34+
"## Setup"
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": null,
40+
"metadata": {},
41+
"outputs": [],
42+
"source": [
43+
"import os, json, time, random, boto3\n",
44+
"\n",
45+
"REGION = \"us-east-1\"\n",
46+
"os.environ[\"AWS_DEFAULT_REGION\"] = REGION\n",
47+
"os.environ[\"SAGEMAKER_REGION\"] = REGION\n",
48+
"\n",
49+
"from sagemaker.core.helper.session_helper import get_execution_role\n",
50+
"role_arn = get_execution_role()\n",
51+
"print(f\"Role: {role_arn}\")"
52+
]
53+
},
54+
{
55+
"cell_type": "markdown",
56+
"metadata": {},
57+
"source": [
58+
"## Step 1: Retrieve the completed training job\n",
59+
"\n",
60+
"Use an existing completed Nova SFT training job. Replace the job name with your own."
61+
]
62+
},
63+
{
64+
"cell_type": "code",
65+
"execution_count": null,
66+
"metadata": {},
67+
"outputs": [],
68+
"source": [
69+
"from sagemaker.core.resources import TrainingJob\n",
70+
"\n",
71+
"training_job = TrainingJob.get(training_job_name=\"nova-textgeneration-micro-sft-20251208154822\")\n",
72+
"print(f\"Training job: {training_job.training_job_name}\")\n",
73+
"print(f\"Status: {training_job.training_job_status}\")"
74+
]
75+
},
76+
{
77+
"cell_type": "markdown",
78+
"metadata": {},
79+
"source": [
80+
"## Step 2: Deploy to Bedrock with BedrockModelBuilder\n",
81+
"\n",
82+
"The builder handles the full deployment flow:\n",
83+
"- Fetches the model package from the training job\n",
84+
"- Detects it as a Nova model\n",
85+
"- Reads the checkpoint URI from the training output manifest\n",
86+
"- Creates a Bedrock custom model and polls until Active\n",
87+
"- Creates a deployment and polls until Active"
88+
]
89+
},
90+
{
91+
"cell_type": "code",
92+
"execution_count": null,
93+
"metadata": {},
94+
"outputs": [],
95+
"source": [
96+
"from sagemaker.serve.bedrock_model_builder import BedrockModelBuilder\n",
97+
"\n",
98+
"builder = BedrockModelBuilder(model=training_job)\n",
99+
"print(f\"Model package: {builder.model_package}\")\n",
100+
"print(f\"S3 artifacts: {builder.s3_model_artifacts}\")"
101+
]
102+
},
103+
{
104+
"cell_type": "code",
105+
"execution_count": null,
106+
"metadata": {},
107+
"outputs": [],
108+
"source": [
109+
"rand = random.randint(1000, 9999)\n",
110+
"custom_model_name = f\"nova-e2e-{rand}-{int(time.time())}\"\n",
111+
"deployment_name = f\"{custom_model_name}-dep\"\n",
112+
"\n",
113+
"print(f\"Deploying as: {custom_model_name}\")\n",
114+
"print(\"This will poll for model creation and deployment — may take several minutes...\")\n",
115+
"\n",
116+
"response = builder.deploy(\n",
117+
" custom_model_name=custom_model_name,\n",
118+
" role_arn=role_arn,\n",
119+
" deployment_name=deployment_name,\n",
120+
")\n",
121+
"\n",
122+
"deployment_arn = response.get(\"customModelDeploymentArn\")\n",
123+
"print(f\"\\nDeployment ARN: {deployment_arn}\")\n",
124+
"print(\"Deployment is Active and ready for inference.\")"
125+
]
126+
},
127+
{
128+
"cell_type": "markdown",
129+
"metadata": {},
130+
"source": [
131+
"## Step 3: Test inference\n",
132+
"\n",
133+
"Once the deployment is Active, invoke it via the Bedrock Runtime API.\n",
134+
"Nova expects `content` as an array of objects with a `text` key."
135+
]
136+
},
137+
{
138+
"cell_type": "code",
139+
"execution_count": null,
140+
"metadata": {},
141+
"outputs": [],
142+
"source": [
143+
"bedrock_runtime = boto3.client(\"bedrock-runtime\", region_name=REGION)\n",
144+
"\n",
145+
"resp = bedrock_runtime.invoke_model(\n",
146+
" modelId=deployment_arn,\n",
147+
" contentType=\"application/json\",\n",
148+
" body=json.dumps({\n",
149+
" \"messages\": [{\"role\": \"user\", \"content\": [{\"text\": \"What is 7 + 7?\"}]}]\n",
150+
" }),\n",
151+
")\n",
152+
"\n",
153+
"result = json.loads(resp[\"body\"].read())\n",
154+
"print(f\"Response: {json.dumps(result, indent=2)}\")"
155+
]
156+
},
157+
{
158+
"cell_type": "markdown",
159+
"metadata": {},
160+
"source": [
161+
"## Step 4: Cleanup\n",
162+
"\n",
163+
"Delete the deployment and custom model to avoid ongoing charges."
164+
]
165+
},
166+
{
167+
"cell_type": "code",
168+
"execution_count": null,
169+
"metadata": {},
170+
"outputs": [],
171+
"source": [
172+
"bedrock = boto3.client(\"bedrock\", region_name=REGION)\n",
173+
"\n",
174+
"dep_info = bedrock.get_custom_model_deployment(\n",
175+
" customModelDeploymentIdentifier=deployment_arn\n",
176+
")\n",
177+
"model_arn = dep_info.get(\"modelArn\")\n",
178+
"\n",
179+
"# Delete deployment first\n",
180+
"try:\n",
181+
" bedrock.delete_custom_model_deployment(\n",
182+
" customModelDeploymentIdentifier=deployment_arn\n",
183+
" )\n",
184+
" print(f\"Deleted deployment: {deployment_arn}\")\n",
185+
"except Exception as e:\n",
186+
" print(f\"Failed to delete deployment: {e}\")\n",
187+
"\n",
188+
"# Then delete the custom model\n",
189+
"try:\n",
190+
" bedrock.delete_custom_model(modelIdentifier=model_arn)\n",
191+
" print(f\"Deleted custom model: {model_arn}\")\n",
192+
"except Exception as e:\n",
193+
" print(f\"Failed to delete custom model: {e}\")"
194+
]
195+
}
196+
],
197+
"metadata": {
198+
"kernelspec": {
199+
"display_name": "Python 3",
200+
"language": "python",
201+
"name": "python3"
202+
},
203+
"language_info": {
204+
"name": "python",
205+
"version": "3.12.0"
206+
}
207+
},
208+
"nbformat": 4,
209+
"nbformat_minor": 4
210+
}

0 commit comments

Comments
 (0)