Skip to content

Bedrock fix#5642

Merged
nargokul merged 10 commits intoaws:masterfrom
nargokul:bedrock-fix
Mar 19, 2026
Merged

Bedrock fix#5642
nargokul merged 10 commits intoaws:masterfrom
nargokul:bedrock-fix

Conversation

@nargokul
Copy link
Contributor

Source (bedrock_model_builder.py):

Extract _is_nova_model() module-level helper using safe getattr with defaults, replacing all duplicated hasattr chains
Add ValueError guards in deploy() for missing model_package, custom_model_name, and role_arn
Add ValueError guard in create_deployment() for empty/None model_arn
Initialize status = None before the polling loop to prevent UnboundLocalError
Move json and urlparse imports to module level
Replace f-string logging with lazy %s formatting throughout

Add _wait_for_model_active() to poll get_custom_model until the model
reaches Active status before calling create_custom_model_deployment.
This fixes ValidationException when the custom model is not yet ready
for deployment after create_custom_model returns.
Extract _is_nova_model() helper to eliminate duplicated Nova
detection logic across deploy() and _get_s3_artifacts(). Uses
getattr with safe defaults instead of fragile hasattr chains.

Add input validation to deploy() and create_deployment():
- Raise ValueError when model_package is not set
- Raise ValueError when custom_model_name or role_arn missing
  for Nova deployments
- Raise ValueError when model_arn is empty in create_deployment

Move json and urlparse imports to module level (were previously
imported inside _get_checkpoint_uri_from_manifest). Replace
f-string logging with lazy %s formatting throughout.

Initialize status=None before the polling loop in
_wait_for_model_active to avoid UnboundLocalError if the loop
body never executes.

Rewrite unit tests (43 tests) with full coverage:
- _is_nova_model: recipe_name, hub_content_name, case
  insensitivity, missing base_model, None fields
- __init__: None model, TrainingJob, ModelPackage
- Client singletons: caching, injection
- _fetch_model_package: ModelPackage, TrainingJob, ModelTrainer,
  unknown type
- _get_s3_artifacts: None package, non-Nova, Nova delegation,
  Nova fallback
- _get_checkpoint_uri_from_manifest: success, missing key,
  NoSuchKey, not TrainingJob, no artifacts, invalid JSON
- _wait_for_model_active: immediate, polling, Failed, timeout
- create_deployment: polling chain, extra kwargs, empty/None ARN
- deploy: non-Nova, Nova full chain, hub_content_name detection,
  default deployment name, tags, missing params, None stripping

Add integration tests for Nova E2E deployment:
- Training job existence and status verification
- Builder creation and Nova detection via _is_nova_model
- S3 artifacts checkpoint validation
- Full deploy-with-polling flow (marked @pytest.mark.slow)
- Timeout behavior on bogus ARN
- Validation error paths (no model_package, empty model_arn)
- Resource cleanup fixture for deployments and custom models
…eployment

Previously create_deployment() only polled for the custom model
to reach Active status before calling CreateCustomModelDeployment,
but did not wait for the deployment itself to become Active. This
caused callers to receive a deployment ARN that was still in
Creating state, requiring manual polling in user code.

Add _wait_for_deployment_active() that polls
get_custom_model_deployment until status reaches Active, raises
RuntimeError on Failed, and times out after max_wait seconds
(default 3600s, poll interval 30s).

Wire it into create_deployment() so the full flow is now:
1. _wait_for_model_active (poll model creation)
2. create_custom_model_deployment (API call)
3. _wait_for_deployment_active (poll deployment creation)

Gracefully skips deployment polling if the API response does not
contain a customModelDeploymentArn.

Unit tests (48 passing):
- _wait_for_deployment_active: immediate Active, polling,
  Failed status, timeout
- create_deployment: full model+deployment polling chain,
  skip polling when no ARN in response
- deploy Nova chain: updated to verify deployment polling
The TestModelCustomizationDeployment integ tests were failing with
DescribeTrainingJob 'Requested resource not found' because the
SageMaker SDK caches the first session's region internally. The
session-scoped cleanup_e2e_endpoints fixture (autouse) was creating
a session in us-east-1 (default) before the class fixtures could
set us-west-2, causing all subsequent TrainingJob.get calls to hit
the wrong region.

Fix by setting AWS_DEFAULT_REGION=us-west-2 in the
cleanup_e2e_endpoints fixture before any SageMaker session is
created.

Add tests/integ/conftest.py with a session-scoped
nova_training_job_name fixture that implements get-or-create:
- Checks if sdk-integ-nova-micro-sft exists and is Completed
- If InProgress, waits for completion
- If not found, uploads minimal training data to S3 and launches
  a Nova Micro SFT training job via SFTTrainer
- Reused across test_bedrock_nova_e2e.py and
  TestBedrockNovaDeployment in test_model_customization_deployment

Update both Nova test files to use the shared fixture instead of
hardcoded training job names.
The SageMaker SDK's SageMakerClient reads SAGEMAKER_REGION env var
at init time and caches the region for all subsequent API calls.
The cleanup_e2e_endpoints session fixture was the first to create
a SageMakerClient (in the default region), which then poisoned all
subsequent TrainingJob.get calls regardless of the region parameter.

Fix by setting SAGEMAKER_REGION=us-west-2 in cleanup_e2e_endpoints
before any SDK session is created, since all resources in this test
file live in us-west-2. The env var is restored after cleanup.

In CodeBuild (us-west-2) this is a no-op since the default region
already matches. The other test files (triton, tei, tgi) are not
affected since they have their own fixtures and don't import from
this file.
Remove us-east-1 Nova integration tests that cannot run in the
us-west-2 CodeBuild environment:
- Delete tests/integ/test_bedrock_nova_e2e.py
- Delete tests/integ/conftest.py (Nova get-or-create fixture)
- Remove TestBedrockNovaDeployment class from
  test_model_customization_deployment.py

Add example_notebooks/bedrock_nova_deployment.ipynb covering the
full Nova workflow: SFTTrainer fine-tuning, BedrockModelBuilder
deploy with model+deployment polling, inference, and cleanup.

The BedrockModelBuilder source code and unit tests (48 passing)
are unchanged. The us-west-2 integ tests for non-Nova Bedrock
deployment (TestModelCustomizationDeployment) remain.
Add notebooks demonstrating Bedrock deployment workflows:
- bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment
  via BedrockModelBuilder with SFTTrainer fine-tuning
- boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment
- model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment
- 07-ml-model-development(1).ipynb: ML model development workflow
- sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb:
  Clean Nova deployment example with polling, inference, and cleanup
rsareddy0329
rsareddy0329 previously approved these changes Mar 18, 2026
logger = logging.getLogger(__name__)


def _is_nova_model(container) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, but I think this can be moved to common utils as model builder also has this method

Add notebooks demonstrating Bedrock deployment workflows:
- bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment
  via BedrockModelBuilder with SFTTrainer fine-tuning
- boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment
- model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment
- 07-ml-model-development(1).ipynb: ML model development workflow
- sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb:
  Clean Nova deployment example with polling, inference, and cleanup
…flow

Simplify notebook to use existing completed training job with
BedrockModelBuilder deploy flow. Fix Nova inference content format
to use array of {text: ...} objects. Remove broken SFTTrainer cells
that fail due to botocore service model mismatch.
@nargokul nargokul deployed to auto-approve March 19, 2026 01:57 — with GitHub Actions Active
@nargokul nargokul merged commit f8df0a7 into aws:master Mar 19, 2026
17 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants