Bedrock fix by nargokul · Pull Request #5642 · aws/sagemaker-python-sdk

nargokul · 2026-03-18T19:48:39Z

Source (bedrock_model_builder.py):

Extract _is_nova_model() module-level helper using safe getattr with defaults, replacing all duplicated hasattr chains
Add ValueError guards in deploy() for missing model_package, custom_model_name, and role_arn
Add ValueError guard in create_deployment() for empty/None model_arn
Initialize status = None before the polling loop to prevent UnboundLocalError
Move json and urlparse imports to module level
Replace f-string logging with lazy %s formatting throughout

Add _wait_for_model_active() to poll get_custom_model until the model reaches Active status before calling create_custom_model_deployment. This fixes ValidationException when the custom model is not yet ready for deployment after create_custom_model returns.

Extract _is_nova_model() helper to eliminate duplicated Nova detection logic across deploy() and _get_s3_artifacts(). Uses getattr with safe defaults instead of fragile hasattr chains. Add input validation to deploy() and create_deployment(): - Raise ValueError when model_package is not set - Raise ValueError when custom_model_name or role_arn missing for Nova deployments - Raise ValueError when model_arn is empty in create_deployment Move json and urlparse imports to module level (were previously imported inside _get_checkpoint_uri_from_manifest). Replace f-string logging with lazy %s formatting throughout. Initialize status=None before the polling loop in _wait_for_model_active to avoid UnboundLocalError if the loop body never executes. Rewrite unit tests (43 tests) with full coverage: - _is_nova_model: recipe_name, hub_content_name, case insensitivity, missing base_model, None fields - __init__: None model, TrainingJob, ModelPackage - Client singletons: caching, injection - _fetch_model_package: ModelPackage, TrainingJob, ModelTrainer, unknown type - _get_s3_artifacts: None package, non-Nova, Nova delegation, Nova fallback - _get_checkpoint_uri_from_manifest: success, missing key, NoSuchKey, not TrainingJob, no artifacts, invalid JSON - _wait_for_model_active: immediate, polling, Failed, timeout - create_deployment: polling chain, extra kwargs, empty/None ARN - deploy: non-Nova, Nova full chain, hub_content_name detection, default deployment name, tags, missing params, None stripping Add integration tests for Nova E2E deployment: - Training job existence and status verification - Builder creation and Nova detection via _is_nova_model - S3 artifacts checkpoint validation - Full deploy-with-polling flow (marked @pytest.mark.slow) - Timeout behavior on bogus ARN - Validation error paths (no model_package, empty model_arn) - Resource cleanup fixture for deployments and custom models

…eployment Previously create_deployment() only polled for the custom model to reach Active status before calling CreateCustomModelDeployment, but did not wait for the deployment itself to become Active. This caused callers to receive a deployment ARN that was still in Creating state, requiring manual polling in user code. Add _wait_for_deployment_active() that polls get_custom_model_deployment until status reaches Active, raises RuntimeError on Failed, and times out after max_wait seconds (default 3600s, poll interval 30s). Wire it into create_deployment() so the full flow is now: 1. _wait_for_model_active (poll model creation) 2. create_custom_model_deployment (API call) 3. _wait_for_deployment_active (poll deployment creation) Gracefully skips deployment polling if the API response does not contain a customModelDeploymentArn. Unit tests (48 passing): - _wait_for_deployment_active: immediate Active, polling, Failed status, timeout - create_deployment: full model+deployment polling chain, skip polling when no ARN in response - deploy Nova chain: updated to verify deployment polling

The TestModelCustomizationDeployment integ tests were failing with DescribeTrainingJob 'Requested resource not found' because the SageMaker SDK caches the first session's region internally. The session-scoped cleanup_e2e_endpoints fixture (autouse) was creating a session in us-east-1 (default) before the class fixtures could set us-west-2, causing all subsequent TrainingJob.get calls to hit the wrong region. Fix by setting AWS_DEFAULT_REGION=us-west-2 in the cleanup_e2e_endpoints fixture before any SageMaker session is created. Add tests/integ/conftest.py with a session-scoped nova_training_job_name fixture that implements get-or-create: - Checks if sdk-integ-nova-micro-sft exists and is Completed - If InProgress, waits for completion - If not found, uploads minimal training data to S3 and launches a Nova Micro SFT training job via SFTTrainer - Reused across test_bedrock_nova_e2e.py and TestBedrockNovaDeployment in test_model_customization_deployment Update both Nova test files to use the shared fixture instead of hardcoded training job names.

The SageMaker SDK's SageMakerClient reads SAGEMAKER_REGION env var at init time and caches the region for all subsequent API calls. The cleanup_e2e_endpoints session fixture was the first to create a SageMakerClient (in the default region), which then poisoned all subsequent TrainingJob.get calls regardless of the region parameter. Fix by setting SAGEMAKER_REGION=us-west-2 in cleanup_e2e_endpoints before any SDK session is created, since all resources in this test file live in us-west-2. The env var is restored after cleanup. In CodeBuild (us-west-2) this is a no-op since the default region already matches. The other test files (triton, tei, tgi) are not affected since they have their own fixtures and don't import from this file.

Remove us-east-1 Nova integration tests that cannot run in the us-west-2 CodeBuild environment: - Delete tests/integ/test_bedrock_nova_e2e.py - Delete tests/integ/conftest.py (Nova get-or-create fixture) - Remove TestBedrockNovaDeployment class from test_model_customization_deployment.py Add example_notebooks/bedrock_nova_deployment.ipynb covering the full Nova workflow: SFTTrainer fine-tuning, BedrockModelBuilder deploy with model+deployment polling, inference, and cleanup. The BedrockModelBuilder source code and unit tests (48 passing) are unchanged. The us-west-2 integ tests for non-Nova Bedrock deployment (TestModelCustomizationDeployment) remain.

Add notebooks demonstrating Bedrock deployment workflows: - bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment via BedrockModelBuilder with SFTTrainer fine-tuning - boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment - model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment - 07-ml-model-development(1).ipynb: ML model development workflow - sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb: Clean Nova deployment example with polling, inference, and cleanup

rsareddy0329 · 2026-03-18T23:42:57Z

sagemaker-serve/src/sagemaker/serve/bedrock_model_builder.py

+logger = logging.getLogger(__name__)
+
+
+def _is_nova_model(container) -> bool:


not a blocker, but I think this can be moved to common utils as model builder also has this method

Add notebooks demonstrating Bedrock deployment workflows: - bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment via BedrockModelBuilder with SFTTrainer fine-tuning - boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment - model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment - 07-ml-model-development(1).ipynb: ML model development workflow - sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb: Clean Nova deployment example with polling, inference, and cleanup

…flow Simplify notebook to use existing completed training job with BedrockModelBuilder deploy flow. Fix Nova inference content format to use array of {text: ...} objects. Remove broken SFTTrainer cells that fail due to botocore service model mismatch.

nargokul added 2 commits March 18, 2026 11:27

nargokul temporarily deployed to auto-approve March 18, 2026 19:51 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 18, 2026 20:35 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 18, 2026 20:36 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 18, 2026 21:19 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 18, 2026 21:20 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 18, 2026 22:06 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 18, 2026 22:39 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 18, 2026 22:42 — with GitHub Actions Inactive

Merge branch 'master' into bedrock-fix

4110403

nargokul temporarily deployed to auto-approve March 18, 2026 22:44 — with GitHub Actions Inactive

rsareddy0329 previously approved these changes Mar 18, 2026

View reviewed changes

nargokul dismissed rsareddy0329’s stale review via e6ad60c March 19, 2026 00:12

nargokul temporarily deployed to auto-approve March 19, 2026 00:12 — with GitHub Actions Inactive

nargokul temporarily deployed to auto-approve March 19, 2026 01:56 — with GitHub Actions Inactive

nargokul deployed to auto-approve March 19, 2026 01:57 — with GitHub Actions Active

rsareddy0329 approved these changes Mar 19, 2026

View reviewed changes

nargokul merged commit f8df0a7 into aws:master Mar 19, 2026
17 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bedrock fix#5642

Bedrock fix#5642
nargokul merged 10 commits intoaws:masterfrom
nargokul:bedrock-fix

nargokul commented Mar 18, 2026

Uh oh!

rsareddy0329 Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		logger = logging.getLogger(__name__)


		def _is_nova_model(container) -> bool:

Conversation

nargokul commented Mar 18, 2026

Uh oh!

rsareddy0329 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants