Enabling AWS (Redshift + EMR) in chuck-data#70
Open
punit-naik-amp wants to merge 24 commits intomainfrom
Open
Conversation
…#48) This PR establishes the base provider architecture for accessing data from different platforms and running Stitch jobs on different compute backends. Changes: - Add DataProvider protocol defining the interface for data sources - Add DatabricksProviderAdapter stub (implementation in PR 2) - Add RedshiftProviderAdapter stub with required AWS credentials, IAM role, and EMR cluster ID (implementation in PR 2) - Add DataProviderFactory for creating data providers - Add ComputeProvider protocol defining the interface for compute backends - Add DatabricksComputeProvider stub (implementation in PR 3) - Add EMRComputeProvider stub (implementation in PR 4) - Add ProviderFactory with unified interface for both provider types - Add comprehensive unit tests (52 tests, all passing) Key design decisions: - Data providers handle storage operations (no separate abstraction) - EMR uses boto3 credential discovery (aws_profile, IAM roles, env vars) - RedshiftProviderAdapter requires AWS credentials and accepts redshift_iam_role for COPY/UNLOAD operations - ComputeProvider.prepare_stitch_job() receives data_provider parameter - Pure additive changes (no modifications to existing code) Jira: CHUCK-10 These is just the scaffolding/additive changes. No code is modified. Doing it in stages so that reviewing becomes easy. Will fold in the actual implementation of databricks and redshift in later PRs.
This PR implements the actual API clients that provider adapters use to communicate with Databricks and AWS Redshift, replacing NotImplementedError stubs with fully functional implementations. Key Changes: 1. RedshiftAPIClient Implementation (chuck_data/clients/redshift.py) - Full Redshift Data API integration using boto3 - Connection validation via list_databases() - Async SQL execution with polling support - Database/schema/table listing and metadata operations - S3 upload/list operations for manifest files - Supports both provisioned clusters and serverless workgroups - Optional AWS credentials (falls back to IAM roles/profiles) 2. Updated Provider Adapters (chuck_data/data_providers/adapters.py) - DatabricksProviderAdapter now uses DatabricksAPIClient - RedshiftProviderAdapter now uses RedshiftAPIClient - Removed all NotImplementedError stubs - All protocol methods delegate to underlying clients - RedshiftProviderAdapter stores additional config (IAM role, EMR cluster ID) 3. Client Module Exports (chuck_data/clients/__init__.py) - Export both DatabricksAPIClient and RedshiftAPIClient - Makes clients available to provider adapters 4. Comprehensive Testing (tests/unit/clients/test_redshift.py) - 47 unit tests for RedshiftAPIClient covering: * Client initialization (cluster and workgroup modes) * Connection validation * List operations (databases, schemas, tables) * Table metadata retrieval * SQL execution (sync and async modes) * S3 operations (upload and list) * Error handling for all operations 5. Updated Adapter Tests - tests/unit/data_providers/test_adapters.py: Verify real client instantiation - tests/unit/data_providers/test_factory.py: Check client creation via factory - tests/unit/test_provider_factory.py: Test ProviderFactory data provider creation
#50) …support Add complete DatabricksComputeProvider implementation as part of the provider pattern refactoring (PR 3 of 8-PR plan). This introduces a separation between data providers (where data lives) and compute providers (where jobs run), enabling future support for cross-platform scenarios like Redshift data on Databricks compute. Key Changes: - Implement DatabricksComputeProvider class with full lifecycle methods: * prepare_stitch_job(): PII scanning, config generation, init script upload * launch_stitch_job(): Config upload, job submission, notebook creation * get_job_status(): Job monitoring via Databricks Jobs API - Add dual code paths in setup_stitch.py with USE_COMPUTE_PROVIDER env var for A/B testing between new provider and legacy stitch_tools - Support multi-target Stitch jobs across multiple catalog.schema targets - Filter unsupported column types (ARRAY, MAP, STRUCT, BINARY, INTERVAL) - Auto-create Stitch report notebooks after job launch - Add versioned cluster init script uploads to prevent concurrent conflicts Testing: - Add 19 comprehensive unit tests covering all provider methods - All tests passing with proper mock isolation - Verified multi-target functionality with 3 catalogs (punit, punit_02, punit_local) Implementation Details: - Uses DatabricksAPIClient directly (PR 6 will integrate DatabricksProviderAdapter) - Maintains backward compatibility via environment variable toggle - Adds metrics tracking with code_path tag for new vs legacy comparison
This PR implements detailed scaffolding for the EMRComputeProvider as outlined in the refactoring guide, preparing the foundation for future EMR integration while maintaining complete backward compatibility. Key Changes: EMR Compute Provider (chuck_data/compute_providers/emr.py): - Expanded class docstring with architecture overview and implementation notes - Added detailed future implementation plans for all four core methods: * prepare_stitch_job: PII scanning, config generation, S3 uploads, EMR steps * launch_stitch_job: Cluster validation, step submission, job registration * get_job_status: EMR API polling with unified status mapping * cancel_job: Step cancellation via EMR CancelSteps API - Added s3_bucket parameter to __init__ for artifact storage - Enhanced method docstrings with step-by-step implementation workflows - Added usage examples and AWS credential discovery documentation - Included logging for provider initialization - Clear NotImplementedError messages directing to future PRs Enhanced Test Coverage (tests/unit/compute_providers/test_emr.py): - Expanded from 7 to 18 comprehensive test cases - Added initialization tests for various configuration scenarios - Added tests for EMR-specific parameters (IAM roles, EC2 keys, Spark config) - Added interface compatibility tests with DatabricksComputeProvider - Verified all methods raise NotImplementedError with informative messages - Added method signature validation tests Provider Factory Updates (chuck_data/provider_factory.py): - Added explicit s3_bucket parameter handling for EMR provider creation - Ensures s3_bucket is properly passed and not absorbed by kwargs Validation Results: - All 18 EMR provider tests passing - All 57 provider tests (data + compute) passing - Factory successfully creates EMR providers - No behavior changes to existing workflows - Complete interface compatibility with DatabricksComputeProvider This scaffolding validates the factory pattern extensibility and provides clear documentation for future EMR implementation without any risk to existing Databricks-based workflows.
Implement PR 5 of the refactoring plan: create a storage provider abstraction to separate artifact upload logic from compute providers. ## Changes ### Storage Providers (new) - Add StorageProvider protocol defining upload_file() interface - Implement DatabricksVolumeStorage for Unity Catalog Volumes * Wraps DatabricksAPIClient for file uploads * Supports client reuse to avoid duplicate connections - Implement S3Storage for Amazon S3 * Uses boto3 with full AWS credential chain support * Supports profiles, explicit credentials, IAM roles * Parses s3:// URLs and validates paths ### Compute Provider Integration - Update DatabricksComputeProvider to use storage_provider.upload_file() * Automatically creates DatabricksVolumeStorage if not provided * Maintains backward compatibility - Update EMRComputeProvider to create S3Storage by default * Supports optional storage_provider injection ### Factory Updates - Add ProviderFactory.create_storage_provider() method * Supports "databricks" and "s3" provider types * Environment variable fallbacks (DATABRICKS_*, AWS_*) * Optional client/session reuse ### Testing - Add 30 storage provider unit tests (100% coverage) * 11 DatabricksVolumeStorage tests * 19 S3Storage tests - Update compute provider tests to mock storage providers - Add 8 factory tests for storage provider creation - Add pytest autouse fixture for boto3 mocking in EMR tests ## Benefits - Clean separation of concerns (storage vs compute vs data) - Easy to add new storage backends in the future - Reusable storage providers across different compute providers - Backward compatible - no breaking changes Files: 8 created, 6 modified (~1,100 lines) Tests: 84/84 passing
… 1) (#53) Create reusable validation module to eliminate inline validation code and provide consistent error handling across Stitch setup commands. New files: - chuck_data/commands/validation.py (214 lines) * validate_single_target_params() - Single catalog/schema validation * validate_multi_target_params() - Multi-target format validation * validate_stitch_config_structure() - Stitch config structure validation * validate_provider_required() - Provider presence validation * validate_amperity_token() - Amperity token validation - tests/unit/commands/test_validation.py (516 lines, 43 tests) * Comprehensive test coverage for all validation functions * Edge case handling (empty strings, invalid formats, None values) Modified files: - chuck_data/commands/stitch_tools.py * Integrated validation module at 4 key points * Replaced inline validation with centralized functions * Added validation imports and TODO for future provider abstraction - chuck_data/commands/setup_stitch.py * Integrated validation module at 6 key points * Consistent validation across interactive and auto-confirm modes * Supports both single-target and multi-target Stitch configurations - tests/unit/commands/test_stitch_tools.py * Updated test assertions for new validation error messages Benefits: - Eliminates 6+ instances of duplicated inline validation code - Provides consistent, user-friendly error messages - Foundation for future provider abstraction work (PR 6 Phase 2) - All 524 tests passing (43 validation + 481 existing) This is Phase 1 of PR 6 (Command Handler Cleanup). Phase 2 will refactor function signatures to use DataProvider and ComputeProvider abstractions instead of raw API clients.
#54) This commit completes Phase 2 of PR 6 by eliminating the temporary dual code paths and feature flag used during the compute provider abstraction rollout. All Stitch setup operations now exclusively use the compute provider abstraction. Changes: - Remove USE_COMPUTE_PROVIDER environment variable flag - Consolidate _handle_compute_provider_setup and _handle_legacy_setup into single _handle_auto_confirm_setup function - Delete deprecated _helper_setup_stitch_logic wrapper (24 lines) - Remove _handle_legacy_setup function entirely (120 lines) - Update metrics tracking to use consistent event context names - Remove "code_path" field from metrics events Test fixes: - Fix mock patch paths in test_service.py and test_setup_stitch.py - Change from patching at definition site to import site - Add workspace_url and token attributes to ConnectionStubMixin - All 27 stitch and policy-related tests passing Impact: - setup_stitch.py: 1,001 → 861 lines (-140 lines, -14%) - stitch_tools.py: 843 → 819 lines (-24 lines, -2.8%) - Total: 1,844 → 1,680 lines (-164 lines, -8.9%) - Cleaner single code path using provider abstraction throughout - No functional changes - all existing workflows preserved
### Overview
Adds comprehensive AWS Redshift support to chuck-data through a provider
abstraction layer, enabling seamless multi-platform operations while
preserving native platform terminology.
### Key Changes
#### 1. Provider Abstraction Layer
- **DataProvider protocol** for unified operations across platforms
- **Adapters** wrapping DatabricksAPIClient and RedshiftAPIClient
- Factory pattern for clean provider instantiation
- Supports Databricks Unity Catalog and AWS Redshift
#### 2. Redshift Integration
- **Complete Redshift Data API client** with query execution, table
operations, and metadata management
- **Redshift-specific commands**: `list_databases`, `select_database`
- **Semantic tagging**: Stores tags in `chuck_metadata.semantic_tags`
table
#### 3. Provider-Aware Command System
- Commands marked with `provider` field ("databricks", "aws_redshift",
or null)
- Dynamic filtering based on active provider
- **Databricks commands**: `list_catalogs`, `select_catalog`
- **Redshift commands**: `list_databases`, `select_database`
- Preserves native terminology (catalogs vs databases)
#### 4. Intelligent Agent Adaptation
- Provider-specific system prompts with correct terminology
- Dynamic prompt selection based on active provider
- Context-aware guidance for platform-specific operations
#### 5. Bulk PII Tagging Refactor
- Unified `DataProvider.tag_columns()` interface
- **Databricks**: ALTER TABLE SET TAGS SQL
- **Redshift**: chuck_metadata.semantic_tags table
- Atomic batch operations with error handling
#### 6. Additional Infrastructure
- Compute provider scaffolding (Databricks, EMR)
- Storage provider abstraction (S3, DBFS)
- Extracted Stitch validation logic into reusable module
- Message sanitization for LLM API compatibility
### Testing
- 780+ lines of Redshift client tests
- 281 lines of adapter tests
- 597 lines of Databricks compute provider tests
- 499 lines of validation tests
- All existing tests passing
### Configuration
New fields in `ChuckConfig`:
- `data_provider`: "databricks" or "aws_redshift"
- `redshift_workgroup_name` / `redshift_cluster_id`: Redshift connection
- `redshift_database`: Active database
- `aws_region`: AWS region
### Impact
**48 files changed, 7,383 insertions(+), 324 deletions(-)**
Existing Databricks workflows unchanged. Commands automatically filtered
based on active provider. Agent adapts prompts dynamically.
# Setup Wizard Provider Selection & Redshift Integration Enhancements
## Overview
This PR introduces comprehensive improvements to the setup wizard,
enabling flexible provider combinations (Databricks + Redshift, Redshift
+ Databricks compute, etc.) and fixing critical bugs in multi-provider
configurations. The changes support the full Redshift + Databricks
integration flow end-to-end.
**Branch**: `CHUCK-10-pr7-setup-wizard-provider-selection`
**Base**: `CHUCK-10-redshift`
**Changes**: 53 files changed, 6,594 insertions(+), 997 deletions(-)
**Commits**: 29
---
## 🎯 Key Features
### 1. **Enhanced Setup Wizard with Provider Selection**
The wizard now supports explicit provider selection for data, compute,
and LLM:
```
Setup Flow:
┌─────────────────────────────────────────────────────────────┐
│ 1. Amperity Auth │
├─────────────────────────────────────────────────────────────┤
│ 2. Data Provider Selection │
│ ├─ Databricks (Unity Catalog) │
│ └─ AWS Redshift │
├─────────────────────────────────────────────────────────────┤
│ 3a. IF Databricks → Workspace URL + Token │
│ 3b. IF Redshift → AWS Profile + Region + Account ID + │
│ Cluster/Workgroup + S3 + IAM Role │
├─────────────────────────────────────────────────────────────┤
│ 4. Compute Provider Selection │
│ └─ Databricks (required for both data providers) │
├─────────────────────────────────────────────────────────────┤
│ 5. LLM Provider Selection │
│ ├─ Databricks │
│ └─ AWS Bedrock │
├─────────────────────────────────────────────────────────────┤
│ 6. Model Selection (based on LLM provider) │
├─────────────────────────────────────────────────────────────┤
│ 7. Usage Tracking Consent │
└─────────────────────────────────────────────────────────────┘
```
**New wizard steps added:**
- `DataProviderSelectionStep` - Choose between Databricks or Redshift
- `ComputeProviderSelectionStep` - Choose compute backend
- `AWSProfileInputStep` - Configure AWS profile for Redshift
- `AWSRegionInputStep` - Configure AWS region
- `AWSAccountIdInputStep` - Configure AWS account ID (required for
Redshift manifests)
- `RedshiftClusterSelectionStep` - Select Redshift cluster or serverless
workgroup
- `S3BucketInputStep` - Configure S3 for Spark-Redshift connector
- `IAMRoleInputStep` - Configure IAM role for Redshift access
### 2. **Provider Abstraction & Dependency Injection**
Introduced clean separation between data providers, compute providers,
and storage providers:
```python
# Before: Tightly coupled, hard-coded Databricks assumptions
client = DatabricksAPIClient(...)
compute = DatabricksComputeProvider(...)
# After: Flexible provider composition with dependency injection
data_provider = ProviderFactory.create_data_provider("redshift", config)
storage_provider = ProviderFactory.create_storage_provider("s3", config)
compute_provider = ProviderFactory.create_compute_provider(
"databricks",
config,
storage_provider=storage_provider # Injected dependency
)
```
**Key abstractions:**
- `IStorageProvider` protocol - Abstract storage (S3, DBFS, Volumes)
- Storage provider injection into compute providers
- Provider detection utilities for automatic routing
- Runtime-checkable protocols for proper type safety
### 3. **Redshift-Specific Commands & Configuration**
Added dedicated Redshift commands and configuration management:
**New Commands:**
- `/list-redshift-schemas` - List Redshift schemas with database context
- `/select-redshift-schema` - Select active Redshift database and schema
- `/redshift-status` - Show current Redshift configuration
**Configuration Management:**
- Automatic cleanup of incompatible config on provider switch
- Proper persistence of AWS account ID, region, cluster info
- Redshift-specific config fields (`redshift_workgroup_name`,
`redshift_iam_role`, etc.)
### 4. **Enhanced Stitch Integration for Redshift**
Complete end-to-end Stitch support for Redshift data sources:
```
Redshift Stitch Flow:
┌──────────────────────────────────────────────────────────────┐
│ 1. Scan Redshift tables for PII (chuck-data) │
│ - Uses LLM to detect semantic tags │
│ - Stores tags in chuck_metadata.semantic_tags │
└────────────────┬─────────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. Generate manifest JSON with semantic tags │
│ - Includes redshift_config with all connection details │
│ - Includes aws_account_id for JDBC URL construction │
│ - Uploads to S3 for Databricks job access │
└────────────────┬─────────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. Submit Stitch job to Databricks │
│ - Fetches init script from Amperity API │
│ - Uploads init script to S3 │
│ - Submits job with manifest and init script paths │
└────────────────┬─────────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. Stitch job processes Redshift data │
│ - Reads manifest from S3 │
│ - Connects to Redshift using Spark-Redshift connector │
│ - Attaches semantic tags to DataFrame metadata │
│ - Runs identity resolution │
│ - Writes results back to Redshift │
└──────────────────────────────────────────────────────────────┘
```
**Manifest generation improvements:**
- Proper `redshift_config` with all required fields
- `aws_account_id` included for JDBC URL construction
- Explicit `data_provider` and `compute_provider` fields
- Support for both provisioned clusters and serverless workgroups
### 5. **Storage Provider Abstraction**
New storage abstraction for managing artifacts across backends:
```python
class IStorageProvider(Protocol):
"""Protocol for storage backends (S3, DBFS, Volumes)"""
def upload_file(self, local_path: str, remote_path: str) -> bool
def download_file(self, remote_path: str, local_path: str) -> bool
def exists(self, remote_path: str) -> bool
def delete(self, remote_path: str) -> bool
```
**Implementations:**
- `S3StorageProvider` - AWS S3 backend (for Redshift)
- `DBFSStorageProvider` - Databricks DBFS (legacy)
- `VolumesStorageProvider` - Unity Catalog Volumes (preferred for
Databricks)
### 6. **Provider-Aware Command Routing**
Commands now automatically detect the active provider and route
appropriately:
```python
# Before: Commands assumed Databricks
def handle_command(client, **kwargs):
catalogs = client.list_catalogs() # Always Databricks
# After: Commands detect provider and route correctly
def handle_command(client, **kwargs):
if is_redshift_client(client):
databases = client.list_databases() # Redshift
else:
catalogs = client.list_catalogs() # Databricks
```
**Provider detection:**
- `is_redshift_client()` - Check if client is RedshiftAPIClient
- `is_databricks_client()` - Check if client is DatabricksAPIClient
- Automatic routing in agent tool executor
---
## 🐛 Critical Bug Fixes
### Bug #1: LLM Provider Selection with Redshift
**Problem:**
When using Redshift as data provider and Databricks as LLM provider, the
wizard crashed with:
```
AttributeError: 'RedshiftAPIClient' object has no attribute 'list_models'
```
**Root cause:**
The wizard was passing `service.client` (RedshiftAPIClient) to
`DatabricksProvider`, which expected a DatabricksAPIClient or None.
**Fix:**
Added type checking to only use service.client if it's a
DatabricksAPIClient:
```python
# Added in LLMProviderSelectionStep and ModelSelectionStep
service = get_chuck_service()
databricks_client = None
if service and service.client and isinstance(service.client, DatabricksAPIClient):
databricks_client = service.client # Only use if correct type
databricks_provider = DatabricksProvider(
workspace_url=state.workspace_url,
token=state.token,
client=databricks_client # None if data provider is Redshift
)
```
### Bug #2: ConfigManager Not Saving Dynamic Fields
**Problem:**
`aws_account_id` and other dynamic fields were silently dropped when
saving config, even though `ChuckConfig` has `extra="allow"`.
**Root cause:**
`ConfigManager.update()` had a `hasattr()` check that prevented
non-schema fields from being set:
```python
# Old buggy code
for key, value in kwargs.items():
if hasattr(config, key): # Prevented dynamic fields!
setattr(config, key, value)
```
**Fix:**
Removed the `hasattr()` check to allow all fields:
```python
# Fixed code
for key, value in kwargs.items():
setattr(config, key, value) # Now accepts all fields
```
**Impact:**
- `aws_account_id` now properly saved to config
- Included in generated Redshift manifests
- All other dynamic config fields also work correctly
---
## 📊 Test Coverage
**New test files:**
- `tests/unit/commands/wizard/test_state.py` - 401 lines, comprehensive
wizard state tests
- `tests/unit/commands/wizard/test_steps.py` - 589 lines, wizard step
validation
- `tests/unit/test_workspace_and_init_scripts.py` - 446 lines, workspace
APIs and protocols
**Updated test coverage:**
- Setup wizard tests updated for new flow
- Stitch integration tests updated for Redshift support
- Compute provider tests updated for dependency injection
- Service tests updated for provider-aware routing
**Test isolation improvements:**
- Tests now use temporary config files to avoid modifying user's
`~/.chuck_config.json`
- Proper cleanup of test artifacts
- Mock providers for unit testing
---
## 🔄 Migration Impact
### Breaking Changes
None - all changes are backward compatible.
### New Required Fields for Redshift Manifests
Generated manifests now include:
```json
{
"settings": {
"data_provider": "redshift",
"compute_provider": "databricks",
"redshift_config": {
"database": "dev",
"schema": "public",
"workgroup_name": "my-workgroup",
"region": "us-west-2",
"aws_account_id": "123456789012" // NEW - required
},
"s3_temp_dir": "s3://bucket/temp/",
"redshift_iam_role": "arn:aws:iam::123456789012:role/Role"
}
}
```
### Config File Changes
New config fields (all optional, added only when Redshift is selected):
- `aws_account_id` - AWS account ID for Redshift
- `aws_region` - AWS region
- `aws_profile` - AWS profile name
- `redshift_workgroup_name` - Serverless workgroup name
- `redshift_cluster_identifier` - Provisioned cluster identifier
- `redshift_iam_role` - IAM role ARN
- `redshift_s3_temp_dir` - S3 temp directory for Spark-Redshift
- `s3_bucket` - S3 bucket for artifacts
---
## 📁 Key File Changes
### Core Setup & Configuration (10 files)
- `chuck_data/commands/setup_wizard.py` - Orchestrator for new wizard
flow
- `chuck_data/commands/wizard/steps.py` - All wizard step
implementations (+666 lines)
- `chuck_data/commands/wizard/state.py` - Wizard state management (+139
lines)
- `chuck_data/config.py` - Config manager with dynamic field support
(+101 lines)
- `chuck_data/service.py` - Provider-aware service initialization (+116
lines)
### Provider Abstraction (8 files)
- `chuck_data/provider_factory.py` - Factory for creating providers (+43
lines)
- `chuck_data/compute_providers/databricks.py` - Databricks with storage
injection
- `chuck_data/compute_providers/emr.py` - EMR with storage support
- `chuck_data/data_providers/utils.py` - Provider detection utilities
(NEW, 172 lines)
- `chuck_data/storage/manifest.py` - Manifest generation for Redshift
(NEW, 378 lines)
### Redshift Integration (5 files)
- `chuck_data/clients/redshift.py` - Enhanced Redshift client (+114
lines)
- `chuck_data/commands/list_redshift_schemas.py` - NEW command (118
lines)
- `chuck_data/commands/redshift_schema_selection.py` - NEW command (183
lines)
- `chuck_data/commands/redshift_status.py` - NEW command (98 lines)
- `chuck_data/commands/setup_stitch.py` - Full Redshift support (+1317
lines)
### Client Enhancements (3 files)
- `chuck_data/clients/databricks.py` - Workspace and init script APIs
(+147 lines)
- `chuck_data/clients/amperity.py` - Moved init script fetch here (+55
lines)
- `chuck_data/ui/tui.py` - Provider-aware UI updates (+163 lines)
---
## ✅ Testing Checklist
- [x] Setup wizard completes successfully for Databricks data provider
- [x] Setup wizard completes successfully for Redshift data provider
- [x] Setup wizard properly saves `aws_account_id` to config
- [x] Mixed provider setup works (Redshift data + Databricks compute +
Databricks LLM)
- [x] Mixed provider setup works (Redshift data + Databricks compute +
Bedrock LLM)
- [x] Generated manifests include all required fields for Redshift
- [x] Stitch setup works end-to-end with Redshift
- [x] Provider detection correctly routes commands
- [x] Config cleanup happens when switching providers
- [x] All unit tests pass
- [x] Test isolation prevents modifying user config
---
## 🎬 Demo Flow
### Complete Redshift + Databricks Setup
```bash
# 1. Run setup wizard
chuck> /setup
# Wizard flow:
# ✓ Amperity Auth (browser-based OAuth)
# ✓ Select Data Provider: AWS Redshift
# ✓ Enter AWS Profile: default
# ✓ Enter AWS Region: us-west-2
# ✓ Enter AWS Account ID: 123456789012
# ✓ Enter Redshift Workgroup: my-workgroup
# ✓ Enter S3 Bucket: my-bucket
# ✓ Enter IAM Role: arn:aws:iam::123456789012:role/RedshiftRole
# ✓ Select Compute Provider: Databricks
# ✓ Enter Workspace URL: https://my-workspace.databricks.com
# ✓ Enter Databricks Token: dapi***
# ✓ Select LLM Provider: Databricks
# ✓ Select Model: databricks-meta-llama-3-1-70b-instruct
# ✓ Usage Consent: yes
# 2. Check configuration
chuck> /redshift-status
✓ Data Provider: AWS Redshift
✓ Region: us-west-2
✓ Workgroup: my-workgroup
✓ Account ID: 123456789012
# 3. Select database and schema
chuck> /select-redshift-schema
# Lists databases, then schemas
# 4. Run Stitch setup
chuck> /setup-stitch
# Generates manifest, uploads to S3, submits Databricks job
✓ Manifest: s3://my-bucket/chuck/manifests/redshift_dev_public_20241218.json
✓ Job submitted: run-id 12345
```
---
## 📝 Commit History Summary
**Provider Abstraction & Architecture** (9 commits)
- Add storage provider abstraction
- Integrate storage providers into compute providers
- Make commands provider-aware
- Add provider detection utilities
- Make protocols runtime-checkable
**Redshift Integration** (8 commits)
- Add Redshift-specific commands
- Update setup_stitch for Redshift support
- Add manifest generation with semantic tags
- Flow AWS credentials through wizard
- Add AWS account ID to config and manifests
**Setup Wizard Enhancements** (7 commits)
- Add data provider selection step
- Add AWS configuration steps (profile, region, account ID)
- Add compute provider selection
- Update wizard orchestration
- Add comprehensive wizard tests
**Bug Fixes & Quality** (5 commits)
- Fix ConfigManager dynamic field saving
- Fix LLM provider selection with Redshift
- Fix test isolation issues
- Update all affected tests
- Add explicit provider fields to manifests
---
## 🚀 Next Steps
After merge, the following work can proceed:
1. End-to-end integration testing with real Redshift cluster
2. Performance validation on large Redshift tables
3. User documentation and guides
4. EMR compute provider support (architecture is ready)
---
## 📚 Related Documentation
- [Redshift Integration
Guide](../app/service/stitch/stitch-standalone/doc/redshift-integration.md)
- [Backend
Abstraction](../app/service/stitch/stitch-standalone/src/amperity/stitch_standalone/backend_abstraction.clj)
- [Generic Main Entry
Point](../app/service/stitch/stitch-standalone/src/amperity/stitch_standalone/generic_main.clj)
Extract common logic for manifest preparation and job launch into reusable helper functions, reducing duplication between interactive and auto-confirm execution paths. Changes: - Add _redshift_prepare_manifest() helper for steps 1-5 (read tags, schemas, generate/validate/upload manifest) - Add _redshift_execute_job_launch() helper for steps 6-8 (fetch init script, submit job, create notebook) - Update _redshift_phase_2_confirm() to use new helpers and remove unused client/compute_provider parameters - Update auto-confirm path in _handle_redshift_stitch_setup() to use helpers - Add 12 new unit tests covering all helper functions and edge cases - Fix test signature mismatches in test_workspace_and_init_scripts.py Impact: - Eliminates ~280 lines of duplicate code (50% reduction in Redshift setup) - Improves maintainability with single source of truth for manifest prep and job launch logic - Maintains full backward compatibility and test coverage (38/38 tests passing)
…ons in setup wizard (#60) This commit introduces dynamic step numbering for the setup wizard and enforces valid data provider + compute provider combinations to match chuck-api backend constraints. ## Dynamic Step Numbering - Add `step_number` and `visited_steps` fields to WizardState for tracking progression through the wizard - Implement dynamic step numbering that adapts to different setup paths: * Databricks-only path: 8 steps * Redshift + Databricks compute path: 15 steps * Databricks + EMR compute path: 11 steps - Step numbers increment only when moving forward, not on retries or errors - Renderer now uses state-based step numbers instead of hardcoded values - Context persistence updated to save/restore step_number and visited_steps ## Provider Combination Validation - Define `VALID_PROVIDER_COMBINATIONS` constant enforcing: * Databricks data → Databricks compute only * Redshift data → Databricks or EMR compute - Dynamically filter compute provider options based on selected data provider - Add validation in ComputeProviderSelectionStep to reject invalid combinations - Update prompts to show only valid options for each data provider ## Setup Stitch Wizard Enhancements - Add comprehensive interactive setup wizard for configuring Stitch jobs - Support for both Databricks and Redshift data sources - Guided workflow through data provider, compute provider, and LLM selection - Interactive catalog/schema/table selection with data preview - Field semantic tagging with LLM-powered PII detection - Column type validation and unsupported type handling - Integration with EMR and Databricks compute providers ## Test Coverage - Fix 16 test failures across EMR client, provider, and adapter tests - Add comprehensive test suites: * test_provider_combinations.py: 13 tests for validation logic * test_llm_provider_filtering.py: Dynamic LLM provider filtering tests * test_setup_wizard_step_numbering.py: Step numbering integration tests * test_renderer.py: Renderer step numbering tests - Update existing tests to handle new wizard flow (Databricks credentials collected before compute provider selection) - All 1027 tests passing (added 13 new tests, fixed 16 existing) ## Bug Fixes - Fix EMR client test expectations for spark-submit command structure - Fix EMR provider mock paths to correctly patch imported modules - Fix Redshift adapter tests to return proper COUNT query results - Update wizard flow tests to remove obsolete compute provider selection step This implementation ensures users can only configure valid combinations that are supported by the chuck-api backend, preventing configuration errors and improving the overall setup experience with clear step-by-step guidance.
Add full EMR compute provider support to /job-status, /jobs, and /monitor-job commands with automatic provider detection and backward compatibility. Changes: - Add get_aws_region() function to config module for EMR configuration - Implement EMR step ID detection and auto-provider selection in /job-status - Auto-detect EMR (s-*) vs Databricks (numeric) from run/step ID format - Add --step-id parameter for direct EMR step queries - Fetch live EMR data via EMRAPIClient when --live flag is used - Display EMR-specific section with step details and console URL - Update /monitor-job to be compute-provider-agnostic - Add --step-id parameter (treated same as --run-id internally) - Make step_id take precedence over run_id when both provided - Update error messages to be provider-neutral - Add EMR job caching in EMRComputeProvider.launch_stitch_job() - Cache job_id, step_id, cluster_id, region for status lookups - Default to Databricks for non-EMR IDs to maintain compatibility - /jobs command requires no changes (already provider-agnostic) Testing: - Add 15 comprehensive tests for EMR job-status scenarios - Add 12 comprehensive tests for EMR monitor-job scenarios - Update 2 existing tests to match new provider-agnostic messages - All 107 tests passing (71 job-status + 36 monitor-job) Backward compatibility: - All existing Databricks workflows unchanged - Existing parameters work as before - Backend databricks-run-id field reused for EMR step IDs
Implement provider tags on Databricks-specific commands to enable
automatic filtering based on the active data provider (Databricks vs
Redshift). This prevents irrelevant commands from appearing in /help
and the agent tool registry.
Changes:
- Tag 11 Databricks commands with provider='databricks':
* Warehouse commands: list_warehouses, select_warehouse,
create_warehouse, warehouse
* Volume commands: list_volumes, create_volume, upload_file
* Catalog/schema: catalog, schema
* SQL execution: run_sql
* Jobs: launch_job
* Stitch: add_stitch_report
- Add comprehensive test suite (21 tests):
* Provider-specific command inclusion/exclusion tests
* Agent command and tool schema filtering tests
* TUI alias resolution with provider filtering
* Command lookup with provider parameter tests
* Data integrity and consistency tests
Testing:
- All 21 tests pass
- Validates correct filtering for Databricks, Redshift, and agnostic commands
- Ensures 19 Databricks commands appear only for Databricks provider
- Ensures 5 Redshift commands appear only for Redshift provider
- Ensures 20 provider-agnostic commands appear for all providers
Impact:
- Databricks provider: 39 user commands, 33 agent commands
- Redshift provider: 25 user commands, 19 agent commands
- Cleaner UX with only relevant commands shown per provider
Extend provider-based command filtering to autocomplete suggestions. Previously, autocomplete would suggest all commands regardless of the active data provider, leading to confusion when irrelevant commands appeared (e.g., Databricks commands when using Redshift). Changes: - Update TUI._get_available_commands() to use get_user_commands() with provider filtering instead of directly accessing TUI_COMMAND_MAP - Extract TUI aliases from provider-filtered commands for autocomplete - Add comprehensive test suite for autocomplete filtering (7 tests) Testing: - All 28 tests pass (21 command registry + 7 autocomplete tests) - Validates autocomplete respects Databricks/Redshift/agnostic filters - Ensures all TUI aliases included for available commands - Tests error handling, deduplication, and sorting Impact: - Databricks users only see Databricks commands in autocomplete - Redshift users only see Redshift commands in autocomplete - Built-in commands (/help, /exit, /quit, /debug) always available - Provider-agnostic commands available for all providers Files modified: - chuck_data/ui/tui.py: Updated autocomplete logic - tests/unit/ui/test_tui_autocomplete.py: New test suite (138 lines)
The /jobs command was not appearing in /help output because it was missing from the help formatter's Job Management category list. Changes: - Add "jobs" to Job Management category in help_formatter.py - Add comprehensive help formatter test suite (5 tests) Testing: - All 33 tests pass (21 registry + 7 autocomplete + 5 help formatter) - Validates /jobs appears in help output - Validates provider filtering in help text - Validates consistent command formatting Impact: - /jobs command now visible in /help output under Job Management - Users can discover the command to list recent jobs from cache
pragyan-amp
approved these changes
Feb 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.