diff --git a/databricks-skills/README.md b/databricks-skills/README.md index 29a79ae..c4ce74a 100644 --- a/databricks-skills/README.md +++ b/databricks-skills/README.md @@ -52,6 +52,7 @@ cp -r ai-dev-kit/databricks-skills/databricks-agent-bricks .claude/skills/ ### 📊 Analytics & Dashboards - **databricks-aibi-dashboards** - Databricks AI/BI dashboards (with SQL validation workflow) +- **databricks-powerbi-migration** - Power BI to Databricks migration (metric views, DAX-to-SQL, ERD generation, schema mapping) - **databricks-unity-catalog** - System tables for lineage, audit, billing ### 🔧 Data Engineering diff --git a/databricks-skills/databricks-powerbi-migration/1-input-scanning.md b/databricks-skills/databricks-powerbi-migration/1-input-scanning.md new file mode 100644 index 0000000..1513571 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/1-input-scanning.md @@ -0,0 +1,168 @@ +# Input Scanning & Model Parsing (Steps 1–2) + +Steps 1 and 2 of the migration workflow: classify all input files and parse Power BI models. + +--- + +## Step 1: Scan, Classify, and Confirm All Inputs + +**Before doing anything else**, read every file in `input/`. Classify each file by content — not extension. + +```bash +python scripts/scan_inputs.py input/ -o reference/input_manifest.json +``` + +Detects: `pbi_model`, `csv_schema_dump`, `mapping_json`, `dbx_schema`, `sql_ddl`, `sql_query_output`, `csv_data`, `sample_report`, `databricks_config`, `unknown`. + +**Present classification to the user and ask:** +1. "I found these files. Here is what each appears to be: [list]. Is this correct?" +2. "How should I use each file?" +3. If no Databricks schema info found: offer schema suggestion queries (Step 5 in [2-catalog-resolution.md](2-catalog-resolution.md)). + +**Do not proceed until the user confirms.** + +### Input File Types + +| Type | Format | Description | +|------|--------|-------------| +| `pbi_model` | `.pbit`, `.pbix`, `.bim`, TMDL directory, or JSON | Exported semantic model — detected by content, not extension | +| `csv_schema_dump` | CSV with `table_name`, `column_name`, `data_type` headers | Schema metadata exported from INFORMATION_SCHEMA | +| `mapping_json` | JSON with `mappings` array | Column-level mappings (Scenario C or D) | +| `dbx_schema` | JSON, SQL DDL, or query output | Schema information from Databricks | +| `sample_report` | `.docx`, `.pdf`, `.png`, `.jpg`, `.xlsx`, `.pptx` | Sample report for KPI reverse-engineering | +| `databricks_config` | YAML with `host`/`token` keys | Workspace URL, PAT, warehouse, catalog, schema | +| `csv_data` | CSV | Headers can inform schema | + +### CSV Schema Dump Detection + +A CSV file is classified as `csv_schema_dump` when its header row contains columns matching these patterns (case-insensitive): + +- `table_name` / `tableName` / `TABLE_NAME` +- `column_name` / `columnName` / `COLUMN_NAME` +- `data_type` / `dataType` / `DATA_TYPE` + +At least `table_name` and `column_name` must be present. When a `csv_schema_dump` is detected: + +1. Parse the CSV to extract table names, column names, and data types. +2. Build a schema representation equivalent to `extract_dbx_schema.py` output. +3. Use this schema for comparison in Step 6. + +--- + +## Step 2: Parse Power BI Models + +```bash +python scripts/parse_pbi_model.py input/ -o reference/pbi_model.json +# or batch mode: +python scripts/parse_pbi_model.py input/ -o reference/pbi_model.json +``` + +The parser handles any file extension — tries ZIP, JSON, and TMDL detection in sequence. + +**Content detection order:** ZIP archive → JSON structure → TMDL text → TMDL directory + +### How to Export Power BI Models + +**Option 1: PBIT file (recommended)** +1. Open your report in Power BI Desktop. +2. File > Export > Power BI Template (`.pbit`). +3. Place in `input/`. + +**Option 2: PBIX file** +The parser extracts the DataModelSchema from `.pbix` files directly. Place in `input/`. + +**Option 3: BIM file** +1. Open the model in [Tabular Editor](https://tabulareditor.com/). +2. File > Save As > `model.bim`. +3. Place in `input/`. + +**Option 4: TMDL directory** +1. Enable TMDL in Power BI Desktop (Options > Preview Features). +2. File > Save As > TMDL format. +3. Place the directory in `input/`. + +**Option 5: Manual description** +Provide table names, column names with types, DAX measures, and relationships. + +--- + +## After Parsing: Immediate Catalog Validation + +**Immediately after parsing (before proceeding to ERD or KPI steps)**, extract all data source references from the PBI model — server names, database/catalog names, and schema names found in `partitions[].source` (connection strings, M expressions, or `Sql.Database` calls). + +Cross-reference these against: +1. Schema files provided in `input/` (DDL, CSV schema dump, JSON schema) +2. Databricks config in `input/` (host/token/catalog) +3. Live MCP access (test with `execute_sql`) + +**If a referenced catalog is inaccessible AND no schema dump was provided**, raise a warning immediately: + +> "The PBI model references data from `.`, but I have no schema information and cannot access this catalog. Please provide one of: +> 1. A schema dump (CSV, DDL, or JSON) in the `input/` folder +> 2. Databricks credentials with access to this catalog +> 3. Run this query and paste the output: `SELECT table_name, column_name, data_type FROM .information_schema.columns WHERE table_schema = ''`" + +**Do not proceed past Step 5 without resolving all catalog gaps.** See [2-catalog-resolution.md](2-catalog-resolution.md) for the full catalog resolution workflow. + +### Extracting Data Sources from the Parsed Model + +Data source references are in: + +1. **Partition source expressions** (`partitions[].source.expression`): + Look for `Sql.Database("server", "database")` in M code. + + ``` + let Source = Sql.Database("myserver.database.windows.net", "my_catalog"), + gold = Source{[Schema="gold"]}[Data], ... + ``` + +2. **Connection string annotations** (`model.annotations` or `model.dataSources`): + Some models store explicit connection strings with server, database, catalog, and schema. + +3. **Table source metadata** (`tables[].partitions[].source`): + For DirectQuery tables, the `source` object may contain `schema` and `entity` (table) names. + +```python +import re + +def extract_data_sources(model: dict) -> list[dict]: + sources = [] + for table in model.get("tables", []): + for partition in table.get("partitions", []): + src = partition.get("source", {}) + expr = src.get("expression", "") + match = re.search(r'Sql\.Database\("([^"]+)",\s*"([^"]+)"\)', expr) + if match: + sources.append({"server": match.group(1), "catalog": match.group(2), "table": table.get("name")}) + schema_match = re.search(r'\[Schema="([^"]+)"\]', expr) + if schema_match and sources: + sources[-1]["schema"] = schema_match.group(1) + return sources +``` + +--- + +## Project Structure Reference + +``` +project-root/ +├── input/ # ALL user-provided files +│ ├── model.pbix +│ ├── mapping.json # Optional +│ ├── schema_dump.sql # Optional +│ ├── schema_export.csv # Optional +│ ├── sample_report.pdf # Optional +│ └── databricks.yml # Optional +├── reference/ +│ ├── input_manifest.json # Output of scan_inputs.py +│ └── pbi_model.json # Output of parse_pbi_model.py +└── temp/ # Working/throwaway files +``` + +Initialize with: + +```bash +bash scripts/init_project.sh +# With all folders: +bash scripts/init_project.sh --all +``` diff --git a/databricks-skills/databricks-powerbi-migration/2-catalog-resolution.md b/databricks-skills/databricks-powerbi-migration/2-catalog-resolution.md new file mode 100644 index 0000000..9f13e70 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/2-catalog-resolution.md @@ -0,0 +1,200 @@ +# Catalog Resolution & Schema Extraction (Steps 3–5) + +Steps 3, 4, and 5 of the migration workflow: validate catalog accessibility, resolve catalog names, and extract Databricks schema. + +--- + +## Step 3: Validate Catalog Accessibility + +**Immediately after parsing**, extract all data source references from the PBI model and cross-reference against: +1. Schema files provided in `input/` (DDL, CSV schema dump, JSON schema) +2. Databricks config in `input/` (host/token/catalog) +3. Live MCP access (test with `execute_sql`) + +**If MCP is available**, test accessibility for each referenced catalog: + +```sql +SELECT 1 FROM .information_schema.tables LIMIT 1; +``` + +Launch parallel subagents when multiple catalogs need testing — see [9-subagent-patterns.md](9-subagent-patterns.md). + +### Warning Message Template + +If a catalog is referenced but neither accessible nor covered by input files: + +``` +⚠ Missing catalog access: The PBI model references `.` (used by tables: ), +but I have no schema information and cannot access this catalog. + +Please provide one of: +1. A schema dump (CSV, DDL, or JSON) for `.` in the input/ folder +2. Databricks credentials with access to this catalog +3. Run this query and paste the output: + SELECT table_name, column_name, data_type, is_nullable, comment + FROM .information_schema.columns + WHERE table_schema = '' + ORDER BY table_name, ordinal_position; +``` + +Update `reference/catalog_resolution.md` with an accessibility status section: + +```markdown +## Catalog Accessibility Status + +| Catalog | Schema | Status | Source | +|---------|--------|--------|--------| +| my_catalog | gold | ✅ Accessible | Live MCP query | +| other_catalog | silver | ✅ Covered | input/other_catalog_schema.csv | +| missing_catalog | dbo | ❌ Inaccessible | No schema info — user action required | +``` + +**Do not proceed past Step 5 without resolving all catalog gaps.** ERD/domain generation (Step 8) can proceed with PBI-only data, but schema comparison and metric view creation require catalog access or schema dumps. + +--- + +## Step 4: Resolve Catalog + +First, list all available catalogs: + +```sql +SELECT catalog_name FROM system.information_schema.catalogs ORDER BY catalog_name; +``` + +Then probe schemas within the target catalog: + +```sql +SELECT schema_name FROM .information_schema.schemata; +``` + +Then verify table existence: + +```sql +SELECT table_name +FROM .information_schema.tables +WHERE table_schema = ''; +``` + +**When multiple candidate catalogs exist** (e.g., `analytics`, `fc_analytics`), launch parallel subagents — one per catalog — to probe concurrently. See [9-subagent-patterns.md](9-subagent-patterns.md) Pattern A. + +### Handling fc_ Prefix + +Some environments prefix catalog names with `fc_`. The agent should: + +1. Try the catalog name as provided +2. If not found, try with `fc_` prefix +3. If not found, try without `fc_` prefix +4. Document both primary and fallback catalog in `reference/catalog_resolution.md` + +### Output: catalog_resolution.md + +```markdown +## Catalog Resolution + +- **Primary catalog**: `my_catalog` +- **Fallback catalog**: `fc_my_catalog` (if applicable) +- **Target schema**: `gold` +- **Tables found**: 15 (listed below) +- **Tables missing**: 2 (listed below) + +### Table Inventory +| Table | Catalog | Schema | Row Count (est.) | +|-------|---------|--------|------------------| +| sales_fact | my_catalog | gold | ~10M | +| product_dim | fc_my_catalog | gold | ~50K | +``` + +--- + +## Step 5: Extract or Ingest Databricks Schema + +If no schema was found in `input/`, suggest these queries: + +```sql +-- Full column schema (recommended) +SELECT table_name, column_name, data_type, is_nullable, comment +FROM .information_schema.columns +WHERE table_schema = '' +ORDER BY table_name, ordinal_position; + +-- Cross-schema comparison +SELECT table_schema, table_name, column_name, data_type +FROM .information_schema.columns +WHERE table_schema IN ('', '') +ORDER BY table_schema, table_name, ordinal_position; + +-- Per-table detail +DESCRIBE TABLE EXTENDED ..; +``` + +Tell the user: *"Paste output in chat, save to input/, or provide catalog.schema for programmatic extraction."* + +**Programmatic extraction** via MCP: + +``` +CallMcpTool: + server: "user-databricks" + toolName: "get_table_details" + arguments: {"catalog": "", "schema": "", "table_stat_level": "SIMPLE"} +``` + +When extracting from **multiple catalogs or schemas**, launch parallel subagents — one per catalog/schema pair. See [9-subagent-patterns.md](9-subagent-patterns.md) Pattern B. + +Also accept DDL files and CSV schema dumps as schema sources. + +### CSV Schema Dump as Schema Source + +A CSV with INFORMATION_SCHEMA-style headers is treated as equivalent to `extract_dbx_schema.py` output: + +```csv +table_name,column_name,data_type,is_nullable,comment +sales_fact,sale_id,BIGINT,NO,Primary key +sales_fact,total_amount,DECIMAL(18,2),YES,Order total +customer_dim,customer_id,BIGINT,NO,Primary key +``` + +Detection criteria: headers must contain `table_name` + `column_name` (at minimum). `data_type` is strongly expected but not strictly required. + +--- + +## Cross-Schema and Multi-Catalog Environments + +### Discover All Schemas in a Catalog + +```sql +SELECT schema_name FROM .information_schema.schemata; +``` + +### Discover All Catalogs + +```sql +SELECT catalog_name FROM system.information_schema.catalogs; +``` + +### Cross-Schema Column Comparison + +```sql +SELECT table_schema, table_name, column_name, data_type +FROM .information_schema.columns +WHERE table_schema IN ('', '') +ORDER BY table_schema, table_name, ordinal_position; +``` + +### When to Use Cross-Schema Probing + +- PBI model references tables from multiple schemas +- Table names exist in multiple schemas (need to disambiguate) +- Migration involves consolidating schemas + +--- + +## Scripts + +### extract_dbx_schema.py + +```bash +python scripts/extract_dbx_schema.py \ + my_catalog my_schema -o reference/dbx_schema.json [--profile PROD] +``` + +**Dependencies:** `databricks-sdk` diff --git a/databricks-skills/databricks-powerbi-migration/3-mapping-layer.md b/databricks-skills/databricks-powerbi-migration/3-mapping-layer.md new file mode 100644 index 0000000..212a5c5 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/3-mapping-layer.md @@ -0,0 +1,162 @@ +# Mapping Layer & Column Gap Analysis (Steps 6–7) + +Steps 6 and 7 of the migration workflow: choose the mapping approach and analyze column gaps. + +--- + +## Step 6: Build Mapping Layer + +Choose the appropriate mapping approach based on schema comparison: + +- **Direct (A)**: Names match — no mapping needed +- **View layer (B)**: Create aliasing views +- **Mapping document (C)**: Generate JSON mapping from comparison output +- **Intermediate mapping (D)**: Extract Power Query M renames, build two-layer mapping (`pbi_column -> dbx_column`), use three-layer only where applicable (`pbi_column -> m_query_column -> dbx_column`) + +```bash +python scripts/compare_schemas.py \ + reference/pbi_model.json reference/dbx_schema.json \ + -o reference/schema_comparison.md --json --mapping reference/intermediate_mapping.json \ + --gap-analysis reference/column_gap_analysis.md +``` + +--- + +## Scenario D: Intermediate Mapping Layer + +When Power Query M expressions rename columns between the PBI semantic layer and the physical database, a direct name comparison fails. Scenario D handles this by extracting the renames and building a mapping. + +### Detection + +Look in the PBI model's `partitions[].source.expression` for M code containing: + +- `Table.RenameColumns` — explicit column renames +- `Table.SelectColumns` — column selection (implies name preservation) +- Schema parameter patterns — `type table [ColName = type text, ...]` + +### Mapping Construction + +Build the **common two-layer mapping** (`pbi_column -> dbx_column`) by default. Use a **three-layer mapping** only where Power Query M introduces an intermediate rename: + +``` +Two-layer (default): + pbi_column -> dbx_column + +Three-layer (only when M renames are present): + pbi_column -> m_query_column -> dbx_column +``` + +### How to Extract M Renames + +1. Parse the PBI model JSON and locate `partitions` on each table. +2. For each partition with `source.type == "m"`, read `source.expression`. +3. Search for `Table.RenameColumns(...)` calls — the argument is a list of `{old, new}` pairs. +4. Search for the `type table [...]` schema definition to find the final column names exposed to the PBI layer. +5. Map backward: PBI column name → M expression column → physical DB column. + +### Example M Expression + +```m +let + Source = Sql.Database("edw-server", "sales_db"), + dbo_Transactions = Source{[Schema="dbo",Item="Transactions"]}[Data], + Renamed = Table.RenameColumns(dbo_Transactions, { + {"TransactionID", "SaleID"}, + {"AmountUSD", "TotalAmount"}, + {"CreatedAt", "OrderDate"} + }), + Selected = Table.SelectColumns(Renamed, {"SaleID", "TotalAmount", "OrderDate", "CustomerID"}) +in + Selected +``` + +### Mapping JSON Format + +**Two-layer (default):** +```json +{ + "mappings": [ + { + "pbi_table": "SalesFact", + "dbx_table": "catalog.gold.sales_transactions", + "columns": [ + {"pbi_column": "SaleID", "dbx_column": "transaction_id"}, + {"pbi_column": "TotalAmount", "dbx_column": "amount_usd"}, + {"pbi_column": "OrderDate", "dbx_column": "created_at"}, + {"pbi_column": "CustomerID", "dbx_column": "customer_id"} + ] + } + ] +} +``` + +**Three-layer (when M renames are relevant):** +```json +{ + "mappings": [ + { + "pbi_table": "SalesFact", + "dbx_table": "catalog.gold.sales_transactions", + "columns": [ + {"pbi_column": "SaleID", "m_query_column": "TransactionID", "dbx_column": "transaction_id"}, + {"pbi_column": "TotalAmount", "m_query_column": "AmountUSD", "dbx_column": "amount_usd"}, + {"pbi_column": "OrderDate", "m_query_column": "CreatedAt", "dbx_column": "created_at"}, + {"pbi_column": "CustomerID", "m_query_column": "CustomerID", "dbx_column": "customer_id"} + ] + } + ] +} +``` + +When `m_query_column` is absent, the mapping is treated as two-layer. + +Pass the intermediate mapping file via the `--mapping` flag: + +```bash +python scripts/compare_schemas.py \ + reference/pbi_model.json reference/dbx_schema.json \ + -o reference/schema_comparison.md --json \ + --mapping reference/intermediate_mapping.json +``` + +--- + +## Step 7: Column Gap Analysis + +After schema comparison, review `reference/column_gap_analysis.md` for DBX-only columns (columns present in Databricks but not referenced in the Power BI model). These may be important for: + +- Filters and partitions in reports built outside PBI +- Discriminator columns that determine row subsets +- Audit/metadata columns needed for data governance + +### Discriminator Heuristics + +A column is flagged as a potential discriminator if its name matches any of: + +- Contains `status`, `type`, `category`, `code`, `flag`, `class`, `kind`, `tier`, `level`, `group` +- Starts with `is_`, `has_`, `can_` +- Ends with `_type`, `_status`, `_code`, `_flag`, `_category`, `_class` + +### Column Gap Analysis Output + +`reference/column_gap_analysis.md` contains: +1. Every DBX-only column grouped by table +2. Discriminator flagging with naming heuristics +3. Suggested actions for each flagged column + +```markdown +## Column Gap Analysis + +### Table: catalog.schema.sales_fact +| Column | Data Type | Discriminator? | Suggested Action | +|--------|-----------|----------------|------------------| +| result_type | STRING | Yes | May filter report subsets — run data discovery | +| etl_load_date | TIMESTAMP | No | Audit column — likely not needed in reports | + +### Table: catalog.schema.customer_dim +| Column | Data Type | Discriminator? | Suggested Action | +|--------|-----------|----------------|------------------| +| customer_status | STRING | Yes | May be essential for active/inactive filtering | +``` + +Flag discriminator columns for data discovery queries (Step 10) — see [4-erd-kpi-discovery.md](4-erd-kpi-discovery.md). diff --git a/databricks-skills/databricks-powerbi-migration/4-erd-kpi-discovery.md b/databricks-skills/databricks-powerbi-migration/4-erd-kpi-discovery.md new file mode 100644 index 0000000..7c9e400 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/4-erd-kpi-discovery.md @@ -0,0 +1,166 @@ +# ERD, KPI Definitions & Data Discovery (Steps 8–10) + +Steps 8, 9, and 10 of the migration workflow: generate ERDs, build structured KPI definitions from DAX, and generate data discovery queries. + +--- + +## Step 8: ERD and Domain Modeling + +```bash +python scripts/generate_erd.py reference/pbi_model.json -o reference/ +``` + +Produces `reference/erd.md` (Mermaid + text ERD) and `reference/domains.md` (domain groupings). **Review with user before proceeding.** + +This step can proceed with PBI-only data even if no Databricks schema is available. + +--- + +## Step 9: KPI Definitions + +Build structured KPI definitions in `kpi/kpi_definitions.md`. Each KPI includes: business context, DAX formula, SQL equivalent, source table, format, data gaps, and domain. + +```bash +bash scripts/init_project.sh --kpi +``` + +### KPI Definition Template + +```markdown +### KPI: +- **Business Context**: +- **DAX Formula**: `` +- **SQL Equivalent**: `` +- **Source Table**: +- **Format**: +- **Data Gaps**: +- **Domain**: +``` + +### Organizing KPIs + +Group KPIs by domain. Within each domain, order by importance (primary KPIs first, derived KPIs after): + +```markdown +# KPI Definitions + +## Domain: Sales + +### KPI: Total Sales +- **Business Context**: Total revenue from all completed sales transactions +- **DAX Formula**: `SUM(SalesFact[TotalAmount])` +- **SQL Equivalent**: `SUM(total_amount)` +- **Source Table**: catalog.gold.sales_fact +- **Format**: Currency, 2 decimals +- **Data Gaps**: None identified +- **Domain**: Sales + +### KPI: Avg Order Value +- **Business Context**: Average revenue per sales transaction +- **DAX Formula**: `DIVIDE([Total Sales], COUNTROWS(SalesFact))` +- **SQL Equivalent**: `SUM(total_amount) / NULLIF(COUNT(1), 0)` +- **Source Table**: catalog.gold.sales_fact +- **Format**: Currency, 2 decimals +- **Data Gaps**: None identified +- **Domain**: Sales + +### KPI: Sales YoY Growth +- **Business Context**: Year-over-year percentage change in total sales +- **DAX Formula**: `DIVIDE([Total Sales] - CALCULATE([Total Sales], SAMEPERIODLASTYEAR(...)), ...)` +- **SQL Equivalent**: Window function with LAG over year partition (see metric view) +- **Source Table**: catalog.gold.sales_fact +- **Format**: Percentage, 1 decimal +- **Data Gaps**: Requires at least 2 years of data for meaningful comparison +- **Domain**: Sales +``` + +### DAX-to-SQL Translation Reference + +| DAX Function | SQL Equivalent | +|---|---| +| `SUM(Sales[Amount])` | `SUM(total_amount)` | +| `COUNTROWS(Orders)` | `COUNT(1)` | +| `DISTINCTCOUNT(Customer[ID])` | `COUNT(DISTINCT customer_id)` | +| `DIVIDE(SUM(...), SUM(...))` | `SUM(...) / NULLIF(SUM(...), 0)` | +| `CALCULATE(SUM(...), Filter)` | Use metric view `filter` or filtered measure expressions | +| `SAMEPERIODLASTYEAR(...)` | Window function with LAG over year partition | + +--- + +## Step 10: Data Discovery Queries + +Auto-generate SQL queries for unknown filter values, value distributions, date ranges, and null rates. Save to `reference/data_discovery_queries.sql`. + +**Execute directly** when Databricks access is available: +- **Same catalog** (preferred): Use `execute_sql_multi` to batch all queries (see [9-subagent-patterns.md](9-subagent-patterns.md) Pattern C) +- **Cross-catalog**: Launch parallel shell subagents per catalog (see [9-subagent-patterns.md](9-subagent-patterns.md) Pattern A) + +If no Databricks access, present queries to the user to run manually and ingest results back. + +### Query Templates + +**For low-cardinality / discriminator columns** (flagged in Step 7): + +```sql +SELECT DISTINCT FROM .. ORDER BY ; +``` + +**For value distribution:** + +```sql +SELECT , COUNT(*) AS cnt +FROM ..
+GROUP BY +ORDER BY cnt DESC +LIMIT 50; +``` + +**For date columns:** + +```sql +SELECT MIN() AS min_date, MAX() AS max_date +FROM ..
; +``` + +**For null rate analysis:** + +```sql +SELECT + COUNT(*) AS total_rows, + COUNT(*) - COUNT() AS null_count, + ROUND((COUNT(*) - COUNT()) * 100.0 / COUNT(*), 2) AS null_pct +FROM ..
; +``` + +### Example Output: reference/data_discovery_queries.sql + +```sql +-- ============================================================= +-- Data Discovery Queries +-- Generated from column gap analysis +-- ============================================================= + +-- 1. Discriminator: sales_fact.result_type +SELECT DISTINCT result_type +FROM catalog.gold.sales_fact +ORDER BY result_type; + +SELECT result_type, COUNT(*) AS cnt +FROM catalog.gold.sales_fact +GROUP BY result_type +ORDER BY cnt DESC +LIMIT 50; + +-- 2. Date range: sales_fact.order_date +SELECT + MIN(order_date) AS min_date, + MAX(order_date) AS max_date +FROM catalog.gold.sales_fact; + +-- 3. Null rate analysis +SELECT + COUNT(*) AS total_rows, + COUNT(*) - COUNT(result_type) AS result_type_nulls, + COUNT(*) - COUNT(order_status) AS order_status_nulls +FROM catalog.gold.sales_fact; +``` diff --git a/databricks-skills/databricks-powerbi-migration/5-query-optimization.md b/databricks-skills/databricks-powerbi-migration/5-query-optimization.md new file mode 100644 index 0000000..8065538 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/5-query-optimization.md @@ -0,0 +1,176 @@ +# Query Access Optimization (Step 11) + +Step 11 of the migration workflow: assess each KPI's query complexity and table characteristics to choose the optimal serving strategy. + +Produce `reference/query_optimization_plan.md` before building metric views. + +--- + +## Assessment Dimensions + +For each KPI or domain, evaluate: + +| Factor | How to Assess | Threshold | +|--------|---------------|-----------| +| **Table size** | `DESCRIBE DETAIL
` — check `sizeInBytes` and `numFiles` | < 100 GB = small, 100 GB–1 TB = medium, > 1 TB = large | +| **Row count** | `SELECT COUNT(*) FROM
` or estimate from DESCRIBE DETAIL | < 100M = small, 100M–1B = medium, > 1B = large | +| **Join count** | Count the number of tables joined per KPI query | 0–2 = simple, 3–5 = moderate, 6+ = complex | +| **Aggregation complexity** | Window functions, CASE expressions, nested aggregations | Simple SUM/COUNT = low, window/CASE = medium, nested = high | +| **Grain mismatch** | Compare fact table grain to report grain | Same grain = no issue, different grain = pre-aggregation needed | +| **Filter selectivity** | Typical filter narrows result to what % of table? | > 50% = low selectivity, < 10% = high selectivity | +| **Refresh frequency** | How often does the source data change? | Real-time, hourly, daily, weekly | + +**Collect table sizing data** using `DESCRIBE DETAIL
` for each candidate table. When assessing multiple tables, batch them with `execute_sql_multi` or use parallel subagents — see [9-subagent-patterns.md](9-subagent-patterns.md) Pattern D. + +--- + +## Query Complexity Scoring + +Assign a score to determine the serving strategy: + +``` +Score = table_size_score + join_score + aggregation_score + grain_score + +table_size_score: small=0, medium=2, large=4 +join_score: 0-2 joins=0, 3-5=1, 6+=2 +aggregation_score: simple=0, medium=1, high=2 +grain_score: same=0, different=2 +``` + +| Total Score | Serving Strategy | +|-------------|-----------------| +| 0–2 | Standard metric view (direct on source tables) | +| 3–5 | Materialized view with scheduled refresh | +| 6+ | Gold-layer aggregate table + metric view on top | + +--- + +## Strategy 1: Standard Metric View + +For simple KPIs on small/medium tables with few joins. The metric view queries source tables directly at runtime. + +```sql +CREATE OR REPLACE VIEW ..sales_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + source: ..sales_fact + measures: + - name: Total Sales + expr: SUM(total_amount) +$$; +``` + +--- + +## Strategy 2: Materialized View with Incremental Refresh + +For complex KPIs or medium/large tables where pre-computing aggregations significantly reduces query time. Databricks automatically manages incremental refresh. + +```sql +CREATE OR REPLACE MATERIALIZED VIEW ..monthly_sales_agg +AS +SELECT + date_trunc('month', order_date) AS order_month, + region, + product_category, + SUM(total_amount) AS total_sales, + COUNT(1) AS order_count, + COUNT(DISTINCT customer_id) AS unique_customers +FROM ..sales_fact +GROUP BY ALL; + +ALTER MATERIALIZED VIEW ..monthly_sales_agg + SCHEDULE CRON '0 2 * * *' AT TIME ZONE 'UTC'; +``` + +Then point the metric view at the materialized view: + +```sql +CREATE OR REPLACE VIEW ..sales_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + source: ..monthly_sales_agg + dimensions: + - name: Order Month + expr: order_month + - name: Region + expr: region + measures: + - name: Total Sales + expr: SUM(total_sales) + - name: Avg Order Value + expr: SUM(total_sales) / NULLIF(SUM(order_count), 0) +$$; +``` + +--- + +## Strategy 3: Gold-Layer Aggregate Table + +For very large tables (> 1 TB) or when the grain mismatch is severe (e.g., transaction-level fact table but report needs monthly aggregates). Build a dedicated gold-layer table with a pipeline for incremental loads. + +```sql +CREATE TABLE ..gold_monthly_sales ( + order_month DATE, + region STRING, + product_category STRING, + total_sales DECIMAL(18,2), + order_count BIGINT, + unique_customers BIGINT, + _etl_updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() +) +USING DELTA +CLUSTER BY (order_month, region); +``` + +Use `spark-declarative-pipelines` skill to maintain incremental refresh, then build metric views on top of the gold table. + +--- + +## Grain Analysis + +When the fact table grain is finer than the report grain, pre-aggregation is essential: + +| Fact Table Grain | Report Grain | Action | +|-----------------|--------------|--------| +| Transaction-level | Daily | Materialized view with `date_trunc('day', ...)` | +| Transaction-level | Monthly | Gold-layer aggregate or materialized view | +| Daily | Monthly | Materialized view with `date_trunc('month', ...)` | +| Same | Same | Standard metric view (no pre-aggregation) | + +--- + +## Decision Matrix + +| Condition | Serving Strategy | +|-----------|-----------------| +| Simple aggregation, table < 100 GB, few joins | Standard metric view | +| Complex joins or expensive aggregation, table 100 GB–1 TB | Materialized view with scheduled refresh | +| Very large table (> 1 TB) or grain mismatch requiring pre-aggregation | Gold-layer aggregate table + metric view on top | +| Mixed: some KPIs simple, some complex | Split across strategies per domain | + +--- + +## Output: reference/query_optimization_plan.md + +```markdown +## Query Optimization Plan + +### Domain: Sales +| KPI | Score | Strategy | Rationale | +|-----|-------|----------|-----------| +| Total Sales | 2 | Standard metric view | Simple SUM, table < 50 GB | +| Sales YoY Growth | 5 | Materialized view | Window function over 2 years, 200 GB table | +| Customer Lifetime Value | 7 | Gold-layer aggregate | 5-table join, 1.2 TB fact table, transaction-to-monthly grain | + +### Materialized Views to Create +1. `monthly_sales_agg` — monthly pre-aggregation for time-series KPIs + - Schedule: daily at 2:00 AM UTC + - Source: sales_fact (200 GB) + - Estimated refresh time: ~15 min + +### Gold-Layer Tables to Create +1. `gold_customer_ltv` — customer lifetime value aggregate + - Pipeline: use spark-declarative-pipelines skill + - Refresh: daily incremental via SDP +``` diff --git a/databricks-skills/databricks-powerbi-migration/6-metric-views.md b/databricks-skills/databricks-powerbi-migration/6-metric-views.md new file mode 100644 index 0000000..d81e0cd --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/6-metric-views.md @@ -0,0 +1,222 @@ +# Metric Views — Check & Build (Steps 12–13) + +Steps 12 and 13 of the migration workflow: detect existing metric views, classify KPIs, and create or update metric views. + +--- + +## Step 12: Check Existing Metric Views + +**Before building metric views**, check whether any KPIs already exist in metric views in the target catalog/schema. This prevents duplication and enables incremental updates. + +**If Databricks access is available**, discover existing metric views: + +```sql +SELECT table_name, view_definition +FROM .information_schema.views +WHERE table_schema = '' + AND view_definition LIKE '%WITH METRICS%'; +``` + +For each discovered metric view, use `manage_metric_views` MCP tool to inspect its measures: + +``` +CallMcpTool: + server: "user-databricks" + toolName: "manage_metric_views" + arguments: + action: "describe" + full_name: ".." +``` + +The result includes `measures` (name + expression) and `dimensions` (name + expression). + +### Comparison Logic + +For each KPI from `kpi/kpi_definitions.md`, compare against all measures in existing metric views: + +``` +For each kpi in kpi_definitions: + match = find measure where normalize(measure.name) == normalize(kpi.name) + if no match: + classification = "new" + elif normalize(measure.expr) == normalize(kpi.sql_equivalent): + classification = "exists" + else: + classification = "update" +``` + +**Normalization** for comparison: lowercase, strip whitespace, remove surrounding quotes, normalize whitespace in expressions, remove catalog/schema prefixes for column references. + +### KPI Classification + +| Classification | Condition | Action in Step 13 | +|---------------|-----------|-------------------| +| `new` | Measure name not found in any existing metric view | `CREATE OR REPLACE VIEW ... WITH METRICS` | +| `update` | Measure name exists but SQL expression differs | `ALTER VIEW` or `manage_metric_views` with `action: "alter"` | +| `exists` | Measure name and expression match | Skip — log as "already deployed" in deployment checklist | + +**If no Databricks access**, skip this step and treat all KPIs as `new`. Log in the deployment checklist: "Manual verification recommended — could not check for existing metric views." + +### Output: reference/existing_metric_views.md + +```markdown +## Existing Metric View Analysis + +### Discovery +Found 2 metric views in `analytics.gold`: +- `sales_metrics` (2 measures: Total Sales, Order Count) +- `customer_metrics` (2 measures: Customer Count, Repeat Customer Rate) + +### KPI Classification +| KPI Name | Domain | Classification | Existing View | Notes | +|----------|--------|---------------|---------------|-------| +| Total Sales | Sales | exists | sales_metrics | Expression matches | +| Avg Order Value | Sales | new | — | Not in any existing view | +| Sales YoY Growth | Sales | new | — | Not in any existing view | +| Customer Count | Customer | exists | customer_metrics | Expression matches | +| Gross Margin | Finance | new | — | No finance view exists | + +### Views to Modify +- None (no `update` classifications in this run) + +### New Metric Views to Create +- **sales_metrics**: ALTER to add `Avg Order Value` and `Sales YoY Growth` +- **finance_metrics**: CREATE new view with `Gross Margin` + +### Skipped (Already Deployed) +- Total Sales (in sales_metrics) +- Customer Count (in customer_metrics) +``` + +--- + +## Step 13: Build or Update Metric Views + +Based on the classification from Step 12: + +- **New KPIs**: Create metric views with `CREATE OR REPLACE VIEW ... WITH METRICS` +- **Updated KPIs**: Use `ALTER VIEW` or `manage_metric_views` with `action: "alter"` +- **Existing KPIs**: Skip — log in deployment checklist as "already deployed" + +Create folders on demand: + +```bash +bash scripts/init_project.sh --models +``` + +Use the `databricks-metric-views` skill for complete YAML syntax details. + +### Metric View SQL Template + +```sql +CREATE OR REPLACE VIEW ..domain_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + comment: "Sales KPIs translated from Power BI semantic model" + source: ..fact_table + dimensions: + - name: Dimension Name + expr: column_expression + comment: "Description" + measures: + - name: Measure Name + expr: AGG_FUNC(column) + comment: "DAX: ORIGINAL_FORMULA" + joins: + - name: dim_table + source: ..dim_table + on: fact_table.join_key = dim_table.join_key +$$; +``` + +### Full Example: Sales Metrics + +```sql +CREATE OR REPLACE VIEW analytics.gold.sales_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + comment: "Sales KPIs translated from Power BI semantic model" + source: analytics.gold.sales_fact + + dimensions: + - name: Order Date + expr: date_key + - name: Customer Segment + expr: customer_segment + - name: Product Category + expr: product_category + + measures: + - name: Total Sales + expr: SUM(total_amount) + comment: "DAX: SUM(SalesFact[TotalAmount])" + - name: Order Count + expr: COUNT(1) + - name: Avg Order Value + expr: SUM(total_amount) / NULLIF(COUNT(1), 0) + comment: "DAX: DIVIDE([Total Sales], COUNTROWS(SalesFact))" + - name: Distinct Customers + expr: COUNT(DISTINCT customer_id) + comment: "DAX: DISTINCTCOUNT(SalesFact[CustomerID])" + + joins: + - name: customer_dim + source: analytics.gold.customer_dim + on: sales_fact.customer_id = customer_dim.customer_id + - name: date_dim + source: analytics.gold.date_dim + on: sales_fact.date_key = date_dim.date_key +$$; +``` + +### Altering an Existing Metric View + +To add new measures to an existing metric view: + +```sql +ALTER VIEW analytics.gold.sales_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + source: analytics.gold.sales_fact + dimensions: + - name: Order Month + expr: date_trunc('month', order_date) + - name: Region + expr: region + measures: + - name: Total Sales + expr: SUM(total_amount) + comment: "DAX: SUM(SalesFact[TotalAmount])" + - name: Avg Order Value + expr: SUM(total_amount) / NULLIF(COUNT(1), 0) + comment: "DAX: DIVIDE([Total Sales], COUNTROWS(SalesFact))" + - name: Sales YoY Growth + expr: (SUM(total_amount) - LAG(SUM(total_amount)) OVER (ORDER BY date_trunc('month', order_date))) / NULLIF(LAG(SUM(total_amount)) OVER (ORDER BY date_trunc('month', order_date)), 0) + comment: "DAX: DIVIDE([Total Sales] - CALCULATE([Total Sales], SAMEPERIODLASTYEAR(...)), ...)" +$$; +``` + +### Querying a Metric View + +```sql +SELECT + `Order Date`, + `Customer Segment`, + MEASURE(`Total Sales`) AS total_sales, + MEASURE(`Avg Order Value`) AS avg_order_value, + MEASURE(`Distinct Customers`) AS unique_customers +FROM analytics.gold.sales_metrics +GROUP BY ALL +ORDER BY ALL; +``` + +### Output Files + +Store metric view SQL in `models/metric_views/` — one file per business domain: + +``` +models/metric_views/ +├── sales_metrics.sql +├── finance_metrics.sql +└── customer_metrics.sql +``` diff --git a/databricks-skills/databricks-powerbi-migration/7-report-output.md b/databricks-skills/databricks-powerbi-migration/7-report-output.md new file mode 100644 index 0000000..4ad425d --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/7-report-output.md @@ -0,0 +1,173 @@ +# Report Analysis & Output Paths (Steps 14–15) + +Steps 14 and 15 of the migration workflow: analyze sample reports and choose between PBI reconnection or Databricks-native output. + +--- + +## Step 14: Sample Report Analysis + +If `input/` contains `.docx`, `.pdf`, `.png`, `.jpg`, `.xlsx`, or `.pptx` files, analyze them to extract KPI names, formatting, chart types, narrative templates, and disclaimers. Produce `reference/report_analysis.md`. + +### What to Extract + +- **KPI names and values** visible in the report +- **Column formatting** (currency symbols, decimal places, date formats) +- **Chart types** (bar, line, pie, table, scorecard) +- **Narrative templates** (dynamic text patterns like "Sales increased by X% compared to...") +- **Disclaimers and footnotes** +- **Branding** (colors, logos, headers/footers) +- **Filter/slicer positions** and default values + +### Output: reference/report_analysis.md + +```markdown +## Report Analysis: + +### KPIs Identified +| KPI | Value (as shown) | Likely Measure | Format | +|-----|------------------|----------------|--------| +| Total Revenue | $1.2M | SUM(revenue) | Currency, 1 decimal | +| Order Count | 4,521 | COUNT(1) | Integer | + +### Visuals +| # | Type | Title | Dimensions | Measures | +|---|------|-------|------------|----------| +| 1 | Scorecard | Key Metrics | — | Total Revenue, Order Count | +| 2 | Line Chart | Monthly Trend | Month | Total Revenue | +| 3 | Table | Top Products | Product | Revenue, Units | + +### Narrative Templates +- "Revenue for {period} was {Total Revenue}, a {YoY Change}% change from the prior year." + +### Disclaimers +- "Data as of {last_refresh_date}. Excludes returns processed after close." +``` + +--- + +## Step 15: Output Path + +After report analysis, choose the output path: + +--- + +### Path A: Power BI Reconnection + +Update Power Query M formulas to use `Databricks.Catalogs()`. Use DirectQuery for facts, Dual for dimensions. Parameterize `ServerHostName`/`HTTPPath`. + +**Power Query M formula pattern:** + +```m +let + Source = Databricks.Catalogs(ServerHostName, HTTPPath, + [Catalog=null, Database=null, EnableAutomaticProxyDiscovery=null]), + catalog = Source{[Name=CatalogName, Kind="Database"]}[Data], + schema = catalog{[Name="gold", Kind="Schema"]}[Data], + metric_view = schema{[Name="sales_metrics", Kind="Table"]}[Data] +in + metric_view +``` + +**Connection parameterization steps:** +1. Create `ServerHostName`, `HTTPPath`, and `CatalogName` parameters in Power Query (Type must be **Text**) +2. Replace hardcoded values in M queries with parameter references for dynamic environments (Dev vs. Prod) +3. Update parameters via Power BI Service UI or REST API for CI/CD pipelines + +**Performance settings:** +- Set DirectQuery for fact tables, Dual for dimension tables +- Enable "Assume Referential Integrity" on relationships where PK/FK constraints are declared with `RELY` +- Configure query parallelization settings (MaxParallelismPerQuery, max connections per data source) +- Use the same SQL Warehouse for datasets querying the same data to maximize caching + +**Simplify Power BI after reconnection:** +- Remove redundant calculated columns and measures now handled by metric views +- "Move left" transformations: prefer SQL views and metric views over Power Query transformations and DAX +- Set "Assume Referential Integrity" on relationships when PK/FK constraints are declared with `RELY` + +--- + +### Path B: Databricks-Native Report + +Read the `databricks-aibi-dashboards` skill. Build a report specification and email template in `planreport/`. + +```bash +bash scripts/init_project.sh --report +``` + +**Report specification template:** + +```markdown +## Report: + +### Page 1: Executive Summary +- **Visual 1**: KPI scorecard (Total Sales, Avg Order Value, Customer Count) +- **Visual 2**: Monthly trend line (Total Sales by Month) +- **Visual 3**: Top 10 table (Products by Revenue) +- **Filters**: Date range, Region, Product Category + +### Page 2: Detail View +- **Visual 1**: Table with drill-through (Order details) +- **Filters**: All from Page 1 + Customer Segment +``` + +**Email distribution template:** + +```markdown +## Email Distribution + +- **Recipients**: [list of email addresses or groups] +- **Schedule**: Weekly, Monday 8:00 AM UTC +- **Subject**: "{Report Name} - Week of {date}" +- **Body**: See narrative template from report_spec.md +- **Attachments**: PDF export of dashboard +- **Format**: HTML with inline charts +``` + +**Deployment configuration:** + +```yaml +report_name: "Sales Weekly Report" +warehouse_id: "" +schedule: + quartz_cron: "0 0 8 ? * MON" + timezone: "UTC" +notifications: + on_success: + - email: "team@company.com" + on_failure: + - email: "admin@company.com" +``` + +**planreport/ folder structure:** + +``` +planreport/ +├── report_spec.md # Visual layout, chart specs, narrative blocks +├── email_template.md # Recipients, schedule, subject, body template +└── deployment_config.yml # Job schedule, warehouse, notification targets +``` + +Use the `databricks-jobs` skill to schedule delivery. + +--- + +## Validation After Migration (both paths) + +### Functional Validation +- Validate that Power BI visuals return identical results as before (Path A), or that Databricks dashboards match original report values (Path B) +- Compare DAX vs. metric view outputs side-by-side for consistency +- Test edge cases: null handling, division by zero, date boundary conditions + +### Performance Validation +- Monitor query performance using Databricks **Query Profile** tools +- Use Power BI **Performance Analyzer** to identify bottleneck visuals (Path A) +- Adjust caching, aggregation strategies, or SQL Warehouse sizing as needed + +### Common Pitfalls + +| Issue | What to Verify | +|-------|----------------| +| Column name casing | Databricks is case-insensitive but Power BI may expect specific casing | +| Data type mismatches | Integers from some sources become decimals in Power BI | +| Relationship loss | PK/FK auto-detection may delete manually created relationships | +| Large result sets | Visuals pulling 1000s of rows indicate inefficient DAX or missing aggregations | diff --git a/databricks-skills/databricks-powerbi-migration/8-deployment.md b/databricks-skills/databricks-powerbi-migration/8-deployment.md new file mode 100644 index 0000000..fcb61c8 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/8-deployment.md @@ -0,0 +1,128 @@ +# Deployment Checklist (Step 16) + +Step 16 of the migration workflow: generate the ordered deployment checklist from local artifacts to a running, scheduled report. + +Output: `reference/deployment_checklist.md` + +--- + +## Checklist Template + +```markdown +## Deployment Checklist: + +**Project**: +**Date**: +**Path**: A (PBI Reconnection) | B (Databricks-Native Report) + +### Pre-Deployment +- [ ] Validate catalog access: + ```sql + SELECT 1 FROM ..
LIMIT 1; + ``` +- [ ] Verify all source tables exist and are accessible +- [ ] Confirm SQL warehouse is running and sized appropriately +- [ ] Verify user/group has SELECT on all required tables and metric views + +### Metric View Deployment +- [ ] Deploy metric views: run each SQL file in models/metric_views/ +- [ ] Verify metric views: + ```sql + SELECT MEASURE(``) FROM .. LIMIT 10; + ``` +- [ ] Grant SELECT to required users/groups: + ```sql + GRANT SELECT ON VIEW .. TO ``; + ``` + +### Path A: Power BI Reconnection +- [ ] Create parameters: `ServerHostName`, `HTTPPath`, `CatalogName` (Type: Text) +- [ ] Update Power Query M formulas to use `Databricks.Catalogs()` connector +- [ ] Set DirectQuery for fact tables, Dual for dimension tables +- [ ] Enable "Assume Referential Integrity" on all relationships with RELY constraints +- [ ] Test: verify key KPI values match original report +- [ ] Publish to Power BI Service +- [ ] Update stored credentials in Power BI Service + +### Path B: Databricks-Native Report +- [ ] Create AI/BI dashboard from planreport/report_spec.md + (read `databricks-aibi-dashboards` skill) +- [ ] Configure job schedule from planreport/deployment_config.yml + (read `databricks-jobs` skill) +- [ ] Set up email distribution from planreport/email_template.md +- [ ] Test dashboard rendering and data accuracy + +### Post-Deployment +- [ ] Run validation queries against metric views +- [ ] Compare output values with original PBI report (5+ key KPIs) +- [ ] Monitor query performance for 1 week +- [ ] Document any known gaps or deferred items in reference/validation_notes.md +- [ ] Share deployment summary with stakeholders +``` + +--- + +## Example: Completed Checklist (Path A) + +```markdown +## Deployment Checklist: Sales Analytics Migration + +**Project**: Sales PBI to Databricks +**Date**: 2026-03-03 +**Path**: A (PBI Reconnection) + +### Pre-Deployment +- [ ] Validate catalog access: + ```sql + SELECT 1 FROM analytics_catalog.gold.sales_fact LIMIT 1; + SELECT 1 FROM analytics_catalog.gold.customer_dim LIMIT 1; + SELECT 1 FROM analytics_catalog.gold.date_dim LIMIT 1; + ``` +- [ ] Verify SQL warehouse `analytics-wh` is running +- [ ] Confirm user has SELECT on `analytics_catalog.gold` + +### Metric View Deployment +- [ ] Run `models/metric_views/sales_metrics.sql` +- [ ] Run `models/metric_views/customer_metrics.sql` +- [ ] Verify: + ```sql + SELECT MEASURE(`Total Sales`) FROM analytics_catalog.gold.sales_metrics LIMIT 10; + SELECT MEASURE(`Customer Count`) FROM analytics_catalog.gold.customer_metrics LIMIT 10; + ``` +- [ ] Grant SELECT to `analysts` group: + ```sql + GRANT SELECT ON VIEW analytics_catalog.gold.sales_metrics TO `analysts`; + ``` + +### Power BI Reconnection +- [ ] Create parameters: `ServerHostName`, `HTTPPath`, `CatalogName` +- [ ] Update M queries to use `Databricks.Catalogs()` connector +- [ ] Set SalesFact to DirectQuery, CustomerDim/DateDim to Dual +- [ ] Enable "Assume Referential Integrity" on all relationships +- [ ] Test: verify Total Sales matches original report value +- [ ] Publish to Power BI Service +- [ ] Update stored credentials in Power BI Service + +### Post-Deployment +- [ ] Compare 5 key KPI values between old and new reports +- [ ] Monitor query performance for 1 week +- [ ] Document any discrepancies in reference/validation_notes.md +- [ ] Share deployment summary with stakeholders +``` + +--- + +## Governance and Automation + +### Governance +- Manage metric versioning and ownership through **Unity Catalog** and audit logs +- Use tags and comments on metric views for discoverability +- Implement row-level security through Unity Catalog grants (replacing Power BI RLS where appropriate) + +### Automation +- Automate metric view deployments with **CI/CD pipelines** using Databricks REST APIs, CLI, or **Databricks Asset Bundles** (read `databricks-asset-bundles` skill) +- Schedule Delta table maintenance (OPTIMIZE, VACUUM) via Databricks Jobs or predictive optimization + +### Data Contracts +- Establish a data contract process: new metrics or schema changes must go through metric view updates, not ad-hoc Power BI model edits +- Document metric definitions, owners, and SLAs in the Unity Catalog metadata diff --git a/databricks-skills/databricks-powerbi-migration/9-subagent-patterns.md b/databricks-skills/databricks-powerbi-migration/9-subagent-patterns.md new file mode 100644 index 0000000..65c1957 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/9-subagent-patterns.md @@ -0,0 +1,178 @@ +# Subagent Parallelization Patterns + +Use `Task` subagents (`subagent_type="shell"`) to parallelize MCP tool calls when work spans multiple catalogs, schemas, or tables. Each subagent runs independently and returns its results. **Max 4 concurrent subagents.** + +--- + +## When to Use Subagents vs. execute_sql_multi + +| Scenario | Best Tool | Why | +|----------|-----------|-----| +| N queries, **same catalog** | `execute_sql_multi` | Single MCP call, built-in parallelism (up to 4 workers), simpler | +| N queries, **different catalogs** | Parallel shell subagents | Each needs a different catalog context | +| Probing N catalogs for table existence | Parallel shell subagents | Each subagent probes one catalog independently | +| Sizing N tables in same catalog | `execute_sql_multi` | Combine N `DESCRIBE DETAIL` into one call | +| Sizing tables across catalogs | Parallel shell subagents | Different catalog contexts | +| Mixed: some cross-catalog, some same-catalog | Combine both | Subagents for cross-catalog, `execute_sql_multi` within each | + +--- + +## Where Subagents Are Used in This Workflow + +| Step | Action | Approach | +|------|--------|----------| +| Step 3 | Test catalog accessibility across multiple catalogs | Parallel shell subagents | +| Step 4 | Probe multiple candidate catalogs in parallel | Parallel shell subagents | +| Step 5 | Extract schema from multiple catalogs/schemas | Parallel shell subagents | +| Step 10 | Run data discovery queries (same catalog) | `execute_sql_multi` (preferred) | +| Step 11 | Run `DESCRIBE DETAIL` on multiple tables | `execute_sql_multi` (same catalog) or subagents (cross-catalog) | + +--- + +## Pattern A: Parallel Catalog Probing + +When Step 3 or 4 identifies multiple candidate catalogs, launch one subagent per catalog to verify which contains the target tables. + +**Subagent prompt template** (one per catalog): + +``` +Probe catalog "" for tables in schema "" using the Databricks MCP server. + +1. Call CallMcpTool with: + - server: "user-databricks" + - toolName: "execute_sql" + - arguments: {"sql_query": "SELECT table_name FROM .information_schema.tables WHERE table_schema = '' ORDER BY table_name"} + +2. Return the list of table names found, or state that no tables were found. +``` + +Launch up to 4 concurrently: + +``` +Task(subagent_type="shell", prompt="