Add foundation provider infrastructure for data and compute providers by punit-naik-amp · Pull Request #48 · amperity/chuck-data

punit-naik-amp · 2025-12-12T16:23:02Z

This PR establishes the base provider architecture for accessing data from different platforms and running Stitch jobs on different compute backends.

Changes:

Add DataProvider protocol defining the interface for data sources
Add DatabricksProviderAdapter stub (implementation in PR 2)
Add RedshiftProviderAdapter stub with required AWS credentials, IAM role, and EMR cluster ID (implementation in PR 2)
Add DataProviderFactory for creating data providers
Add ComputeProvider protocol defining the interface for compute backends
Add DatabricksComputeProvider stub (implementation in PR 3)
Add EMRComputeProvider stub (implementation in PR 4)
Add ProviderFactory with unified interface for both provider types
Add comprehensive unit tests (52 tests, all passing)

Key design decisions:

Data providers handle storage operations (no separate abstraction)
EMR uses boto3 credential discovery (aws_profile, IAM roles, env vars)
RedshiftProviderAdapter requires AWS credentials and accepts redshift_iam_role for COPY/UNLOAD operations
ComputeProvider.prepare_stitch_job() receives data_provider parameter
Pure additive changes (no modifications to existing code)

Jira: CHUCK-10

These is just the scaffolding/additive changes. No code is modified. Doing it in stages so that reviewing becomes easy. Will fold in the actual implementation of databricks and redshift in later PRs.

This PR establishes the base provider architecture for accessing data from different platforms and running Stitch jobs on different compute backends. Changes: - Add DataProvider protocol defining the interface for data sources - Add DatabricksProviderAdapter stub (implementation in PR 2) - Add RedshiftProviderAdapter stub with required AWS credentials, IAM role, and EMR cluster ID (implementation in PR 2) - Add DataProviderFactory for creating data providers - Add ComputeProvider protocol defining the interface for compute backends - Add DatabricksComputeProvider stub (implementation in PR 3) - Add EMRComputeProvider stub (implementation in PR 4) - Add ProviderFactory with unified interface for both provider types - Add comprehensive unit tests (52 tests, all passing) Key design decisions: - Data providers handle storage operations (no separate abstraction) - EMR uses boto3 credential discovery (aws_profile, IAM roles, env vars) - RedshiftProviderAdapter requires AWS credentials and accepts redshift_iam_role for COPY/UNLOAD operations - ComputeProvider.prepare_stitch_job() receives data_provider parameter - Pure additive changes (no modifications to existing code) Jira: CHUCK-10

punit-naik-amp · 2025-12-13T07:00:50Z

@pragyan-amp Changed the base branch from main to CHUCK-10-redshift so that the main branch can stay clean while we merge a bunch of reviewed and tested PRs in stages to the CHUCK-10-redshift branch (as this feature involves a lot of code changes which can't be reviewed easily in one single and huge PR). In the end I will create one final PR from CHUCK-10-redshift to main.

pragyan-amp

Looks Good..

…#48) This PR establishes the base provider architecture for accessing data from different platforms and running Stitch jobs on different compute backends. Changes: - Add DataProvider protocol defining the interface for data sources - Add DatabricksProviderAdapter stub (implementation in PR 2) - Add RedshiftProviderAdapter stub with required AWS credentials, IAM role, and EMR cluster ID (implementation in PR 2) - Add DataProviderFactory for creating data providers - Add ComputeProvider protocol defining the interface for compute backends - Add DatabricksComputeProvider stub (implementation in PR 3) - Add EMRComputeProvider stub (implementation in PR 4) - Add ProviderFactory with unified interface for both provider types - Add comprehensive unit tests (52 tests, all passing) Key design decisions: - Data providers handle storage operations (no separate abstraction) - EMR uses boto3 credential discovery (aws_profile, IAM roles, env vars) - RedshiftProviderAdapter requires AWS credentials and accepts redshift_iam_role for COPY/UNLOAD operations - ComputeProvider.prepare_stitch_job() receives data_provider parameter - Pure additive changes (no modifications to existing code) Jira: CHUCK-10 These is just the scaffolding/additive changes. No code is modified. Doing it in stages so that reviewing becomes easy. Will fold in the actual implementation of databricks and redshift in later PRs.

punit-naik-amp requested a review from pragyan-amp December 12, 2025 16:23

punit-naik-amp self-assigned this Dec 12, 2025

punit-naik-amp force-pushed the CHUCK-10-provider-infrastructure-foundation branch from b537298 to ec5a0a6 Compare December 12, 2025 16:28

punit-naik-amp changed the base branch from main to CHUCK-10-redshift December 13, 2025 06:57

pragyan-amp approved these changes Dec 15, 2025

View reviewed changes

punit-naik-amp merged commit a7ffd33 into CHUCK-10-redshift Dec 15, 2025
2 checks passed

punit-naik-amp deleted the CHUCK-10-provider-infrastructure-foundation branch December 15, 2025 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add foundation provider infrastructure for data and compute providers#48

Add foundation provider infrastructure for data and compute providers#48
punit-naik-amp merged 1 commit intoCHUCK-10-redshiftfrom
CHUCK-10-provider-infrastructure-foundation

punit-naik-amp commented Dec 12, 2025

Uh oh!

punit-naik-amp commented Dec 13, 2025

Uh oh!

pragyan-amp left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

punit-naik-amp commented Dec 12, 2025

Uh oh!

punit-naik-amp commented Dec 13, 2025

Uh oh!

pragyan-amp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants