From ceea803c5588fef5909047d39412d85ae942dea1 Mon Sep 17 00:00:00 2001 From: "Jaiwant.Jonathan" Date: Sat, 28 Feb 2026 19:20:25 -0500 Subject: [PATCH] Add databricks-powerbi-migration skill New skill for migrating Power BI semantic models to Databricks metric views. Covers schema assessment, DAX-to-SQL translation, ERD/domain generation, intermediate mapping layers, query optimization, KPI definitions, data discovery, and deployment checklists with a 16-step guided workflow. --- databricks-skills/README.md | 1 + .../databricks-powerbi-migration/EXAMPLES.md | 545 ++++++++++ .../databricks-powerbi-migration/REFERENCE.md | 943 ++++++++++++++++++ .../databricks-powerbi-migration/SKILL.md | 416 ++++++++ .../databricks-powerbi-migration/approach.md | 313 ++++++ .../scripts/compare_schemas.py | 431 ++++++++ .../scripts/extract_dbx_schema.py | 163 +++ .../scripts/generate_erd.py | 355 +++++++ .../scripts/init_project.sh | 146 +++ .../scripts/parse_pbi_model.py | 459 +++++++++ .../scripts/scan_inputs.py | 399 ++++++++ databricks-skills/install_skills.sh | 4 +- 12 files changed, 4174 insertions(+), 1 deletion(-) create mode 100644 databricks-skills/databricks-powerbi-migration/EXAMPLES.md create mode 100644 databricks-skills/databricks-powerbi-migration/REFERENCE.md create mode 100644 databricks-skills/databricks-powerbi-migration/SKILL.md create mode 100644 databricks-skills/databricks-powerbi-migration/approach.md create mode 100644 databricks-skills/databricks-powerbi-migration/scripts/compare_schemas.py create mode 100644 databricks-skills/databricks-powerbi-migration/scripts/extract_dbx_schema.py create mode 100644 databricks-skills/databricks-powerbi-migration/scripts/generate_erd.py create mode 100755 databricks-skills/databricks-powerbi-migration/scripts/init_project.sh create mode 100644 databricks-skills/databricks-powerbi-migration/scripts/parse_pbi_model.py create mode 100644 databricks-skills/databricks-powerbi-migration/scripts/scan_inputs.py diff --git a/databricks-skills/README.md b/databricks-skills/README.md index ddc5b08..f65a5ed 100644 --- a/databricks-skills/README.md +++ b/databricks-skills/README.md @@ -51,6 +51,7 @@ cp -r ai-dev-kit/databricks-skills/databricks-agent-bricks .claude/skills/ ### 📊 Analytics & Dashboards - **databricks-aibi-dashboards** - Databricks AI/BI dashboards (with SQL validation workflow) +- **databricks-powerbi-migration** - Power BI to Databricks migration (metric views, DAX-to-SQL, ERD generation, schema mapping) - **databricks-unity-catalog** - System tables for lineage, audit, billing ### 🔧 Data Engineering diff --git a/databricks-skills/databricks-powerbi-migration/EXAMPLES.md b/databricks-skills/databricks-powerbi-migration/EXAMPLES.md new file mode 100644 index 0000000..e9f38a5 --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/EXAMPLES.md @@ -0,0 +1,545 @@ +# Examples: Input/Output Patterns + +Concrete examples for key skill workflows. Referenced from [SKILL.md](SKILL.md) and [REFERENCE.md](REFERENCE.md). + +--- + +## EDW-to-CDM Intermediate Mapping (Gap 1) + +### Input: Power Query M Expression (from PBI model) + +```m +let + Source = Sql.Database("edw-server", "sales_db"), + dbo_Transactions = Source{[Schema="dbo",Item="Transactions"]}[Data], + Renamed = Table.RenameColumns(dbo_Transactions, { + {"TransactionID", "SaleID"}, + {"AmountUSD", "TotalAmount"}, + {"CreatedAt", "OrderDate"} + }), + Selected = Table.SelectColumns(Renamed, {"SaleID", "TotalAmount", "OrderDate", "CustomerID"}) +in + Selected +``` + +### Output: Two-Layer Mapping (default) + +```json +{ + "mappings": [ + { + "pbi_table": "SalesFact", + "dbx_table": "catalog.gold.sales_transactions", + "columns": [ + {"pbi_column": "SaleID", "dbx_column": "transaction_id"}, + {"pbi_column": "TotalAmount", "dbx_column": "amount_usd"}, + {"pbi_column": "OrderDate", "dbx_column": "created_at"}, + {"pbi_column": "CustomerID", "dbx_column": "customer_id"} + ] + } + ] +} +``` + +### Output: Three-Layer Mapping (when M renames are relevant) + +```json +{ + "mappings": [ + { + "pbi_table": "SalesFact", + "dbx_table": "catalog.gold.sales_transactions", + "columns": [ + {"pbi_column": "SaleID", "m_query_column": "TransactionID", "dbx_column": "transaction_id"}, + {"pbi_column": "TotalAmount", "m_query_column": "AmountUSD", "dbx_column": "amount_usd"}, + {"pbi_column": "OrderDate", "m_query_column": "CreatedAt", "dbx_column": "created_at"}, + {"pbi_column": "CustomerID", "m_query_column": "CustomerID", "dbx_column": "customer_id"} + ] + } + ] +} +``` + +--- + +## KPI Definition Template (Gap 7) + +### Input: Power BI DAX Measures + +```dax +Total Sales = SUM(SalesFact[TotalAmount]) +Avg Order Value = DIVIDE([Total Sales], COUNTROWS(SalesFact)) +Sales YoY Growth = DIVIDE( + [Total Sales] - CALCULATE([Total Sales], SAMEPERIODLASTYEAR(DateDim[Date])), + CALCULATE([Total Sales], SAMEPERIODLASTYEAR(DateDim[Date])) +) +Customer Count = DISTINCTCOUNT(SalesFact[CustomerID]) +``` + +### Output: kpi/kpi_definitions.md + +```markdown +# KPI Definitions + +## Domain: Sales + +### KPI: Total Sales +- **Business Context**: Total revenue from all completed sales transactions +- **DAX Formula**: `SUM(SalesFact[TotalAmount])` +- **SQL Equivalent**: `SUM(total_amount)` +- **Source Table**: catalog.gold.sales_fact +- **Format**: Currency, 2 decimals +- **Data Gaps**: None identified +- **Domain**: Sales + +### KPI: Avg Order Value +- **Business Context**: Average revenue per sales transaction +- **DAX Formula**: `DIVIDE([Total Sales], COUNTROWS(SalesFact))` +- **SQL Equivalent**: `SUM(total_amount) / NULLIF(COUNT(1), 0)` +- **Source Table**: catalog.gold.sales_fact +- **Format**: Currency, 2 decimals +- **Data Gaps**: None identified +- **Domain**: Sales + +### KPI: Sales YoY Growth +- **Business Context**: Year-over-year percentage change in total sales +- **DAX Formula**: `DIVIDE([Total Sales] - CALCULATE([Total Sales], SAMEPERIODLASTYEAR(...)), ...)` +- **SQL Equivalent**: Window function with LAG over year partition (see metric view) +- **Source Table**: catalog.gold.sales_fact +- **Format**: Percentage, 1 decimal +- **Data Gaps**: Requires at least 2 years of data for meaningful comparison +- **Domain**: Sales + +### KPI: Customer Count +- **Business Context**: Number of unique customers with at least one transaction +- **DAX Formula**: `DISTINCTCOUNT(SalesFact[CustomerID])` +- **SQL Equivalent**: `COUNT(DISTINCT customer_id)` +- **Source Table**: catalog.gold.sales_fact +- **Format**: Integer +- **Data Gaps**: None identified +- **Domain**: Sales +``` + +--- + +## Data Discovery Queries (Gap 4) + +### Input: Column Gap Analysis identifies discriminator columns + +``` +sales_fact.result_type (flagged as discriminator) +sales_fact.order_status (flagged as discriminator) +sales_fact.order_date (date column) +customer_dim.customer_status (flagged as discriminator) +``` + +### Output: reference/data_discovery_queries.sql + +```sql +-- ============================================================= +-- Data Discovery Queries +-- Generated from column gap analysis +-- ============================================================= + +-- 1. Discriminator: sales_fact.result_type +SELECT DISTINCT result_type +FROM catalog.gold.sales_fact +ORDER BY result_type; + +SELECT result_type, COUNT(*) AS cnt +FROM catalog.gold.sales_fact +GROUP BY result_type +ORDER BY cnt DESC +LIMIT 50; + +-- 2. Discriminator: sales_fact.order_status +SELECT DISTINCT order_status +FROM catalog.gold.sales_fact +ORDER BY order_status; + +SELECT order_status, COUNT(*) AS cnt +FROM catalog.gold.sales_fact +GROUP BY order_status +ORDER BY cnt DESC +LIMIT 50; + +-- 3. Date range: sales_fact.order_date +SELECT + MIN(order_date) AS min_date, + MAX(order_date) AS max_date +FROM catalog.gold.sales_fact; + +-- 4. Discriminator: customer_dim.customer_status +SELECT DISTINCT customer_status +FROM catalog.gold.customer_dim +ORDER BY customer_status; + +SELECT customer_status, COUNT(*) AS cnt +FROM catalog.gold.customer_dim +GROUP BY customer_status +ORDER BY cnt DESC +LIMIT 50; + +-- 5. Null rate analysis for all gap columns +SELECT + COUNT(*) AS total_rows, + COUNT(*) - COUNT(result_type) AS result_type_nulls, + COUNT(*) - COUNT(order_status) AS order_status_nulls +FROM catalog.gold.sales_fact; +``` + +--- + +## Deployment Checklist (Gap 11) + +### Input: Completed project with metric views and Path A chosen + +### Output: reference/deployment_checklist.md + +```markdown +## Deployment Checklist: Sales Analytics Migration + +**Project**: Sales PBI to Databricks +**Date**: 2026-02-26 +**Path**: A (PBI Reconnection) + +### Pre-Deployment +- [ ] Validate catalog access: + ```sql + SELECT 1 FROM analytics_catalog.gold.sales_fact LIMIT 1; + SELECT 1 FROM analytics_catalog.gold.customer_dim LIMIT 1; + SELECT 1 FROM analytics_catalog.gold.date_dim LIMIT 1; + ``` +- [ ] Verify SQL warehouse `analytics-wh` is running +- [ ] Confirm user has SELECT on `analytics_catalog.gold` + +### Metric View Deployment +- [ ] Run `models/metric_views/sales_metrics.sql` +- [ ] Run `models/metric_views/customer_metrics.sql` +- [ ] Verify: + ```sql + SELECT MEASURE(`Total Sales`) FROM analytics_catalog.gold.sales_metrics LIMIT 10; + SELECT MEASURE(`Customer Count`) FROM analytics_catalog.gold.customer_metrics LIMIT 10; + ``` +- [ ] Grant SELECT to `analysts` group: + ```sql + GRANT SELECT ON VIEW analytics_catalog.gold.sales_metrics TO `analysts`; + ``` + +### Power BI Reconnection +- [ ] Create parameters: `ServerHostName`, `HTTPPath`, `CatalogName` +- [ ] Update M queries to use `Databricks.Catalogs()` connector +- [ ] Set SalesFact to DirectQuery, CustomerDim/DateDim to Dual +- [ ] Enable "Assume Referential Integrity" on all relationships +- [ ] Test: verify Total Sales matches original report value +- [ ] Publish to Power BI Service +- [ ] Update stored credentials in Power BI Service + +### Post-Deployment +- [ ] Compare 5 key KPI values between old and new reports +- [ ] Monitor query performance for 1 week +- [ ] Document any discrepancies in reference/validation_notes.md +- [ ] Share deployment summary with stakeholders +``` + +--- + +## CSV Schema Dump (Gap 2) + +### Input: CSV file with INFORMATION_SCHEMA-style headers + +```csv +table_name,column_name,data_type,is_nullable,comment +sales_fact,sale_id,BIGINT,NO,Primary key +sales_fact,customer_id,BIGINT,NO,FK to customer_dim +sales_fact,order_date,DATE,NO,Date of order +sales_fact,total_amount,DECIMAL(18,2),YES,Order total in USD +customer_dim,customer_id,BIGINT,NO,Primary key +customer_dim,customer_name,STRING,YES,Full name +customer_dim,customer_status,STRING,YES,Active/Inactive +``` + +### Scanner Output + +```json +{ + "path": "input/schema_export.csv", + "name": "schema_export.csv", + "type": "csv_schema_dump", + "details": "Schema dump: 2 tables, 7 columns" +} +``` + +The agent should parse this CSV and construct a schema representation equivalent to `extract_dbx_schema.py` output for use in schema comparison. + +--- + +## Catalog Resolution (Gap 5) + +### Input: User provides catalog name "analytics" + +### Agent probes + +```sql +SELECT catalog_name FROM system.information_schema.catalogs ORDER BY catalog_name; +-- Result: analytics, fc_analytics, hive_metastore, system +``` + +### Output: reference/catalog_resolution.md + +```markdown +## Catalog Resolution + +- **Primary catalog**: `analytics` +- **Fallback catalog**: `fc_analytics` +- **Target schema**: `gold` + +### Verification +| Table | Found In | Schema | +|-------|----------|--------| +| sales_fact | analytics | gold | +| customer_dim | analytics | gold | +| date_dim | analytics | gold | +| product_dim | fc_analytics | gold | + +**Note**: `product_dim` found only in `fc_analytics`. Verify if this is the correct source. +``` + +--- + +## Parallel Catalog Probing with Shell Subagents (Gap 13) + +### Input: Catalog list from `system.information_schema.catalogs` + +``` +analytics, fc_analytics, hive_metastore, system +``` + +Target schema: `gold`. PBI model references tables: `sales_fact`, `customer_dim`, `date_dim`, `product_dim`. + +### Agent launches 3 parallel shell subagents + +``` +Task(subagent_type="shell", description="Probe analytics catalog", + prompt='Probe catalog "analytics" for tables in schema "gold" using the Databricks MCP server. + Call CallMcpTool with server="user-databricks", toolName="execute_sql", + arguments={"sql_query": "SELECT table_name FROM analytics.information_schema.tables WHERE table_schema = \'gold\' ORDER BY table_name"}. + Return the list of table names found.') + +Task(subagent_type="shell", description="Probe fc_analytics catalog", + prompt='Probe catalog "fc_analytics" for tables in schema "gold" using the Databricks MCP server. + Call CallMcpTool with server="user-databricks", toolName="execute_sql", + arguments={"sql_query": "SELECT table_name FROM fc_analytics.information_schema.tables WHERE table_schema = \'gold\' ORDER BY table_name"}. + Return the list of table names found.') + +Task(subagent_type="shell", description="Probe hive_metastore catalog", + prompt='Probe catalog "hive_metastore" for tables in schema "gold" using the Databricks MCP server. + Call CallMcpTool with server="user-databricks", toolName="execute_sql", + arguments={"sql_query": "SELECT table_name FROM hive_metastore.information_schema.tables WHERE table_schema = \'gold\' ORDER BY table_name"}. + Return the list of table names found.') +``` + +### Merged output: reference/catalog_resolution.md + +```markdown +## Catalog Resolution + +- **Primary catalog**: `analytics` +- **Fallback catalog**: `fc_analytics` +- **Target schema**: `gold` + +### Table Inventory +| Table | Catalog | Schema | +|-------|---------|--------| +| sales_fact | analytics | gold | +| customer_dim | analytics | gold | +| date_dim | analytics | gold | +| product_dim | fc_analytics | gold | +``` + +--- + +## Batch Data Discovery with execute_sql_multi (Gap 13) + +### Input: Data discovery queries from Step 9 + +All queries target the same catalog (`analytics.gold`). + +### Agent calls execute_sql_multi + +``` +CallMcpTool: + server: "user-databricks" + toolName: "execute_sql_multi" + arguments: + sql_content: | + -- Discriminator: sales_fact.result_type + SELECT DISTINCT result_type FROM analytics.gold.sales_fact ORDER BY result_type; + + -- Distribution: sales_fact.result_type + SELECT result_type, COUNT(*) AS cnt FROM analytics.gold.sales_fact GROUP BY result_type ORDER BY cnt DESC LIMIT 50; + + -- Date range: sales_fact.order_date + SELECT MIN(order_date) AS min_date, MAX(order_date) AS max_date FROM analytics.gold.sales_fact; + + -- Discriminator: customer_dim.customer_status + SELECT DISTINCT customer_status FROM analytics.gold.customer_dim ORDER BY customer_status; + + -- Null rate analysis + SELECT COUNT(*) AS total_rows, COUNT(*) - COUNT(result_type) AS result_type_nulls, COUNT(*) - COUNT(order_status) AS order_status_nulls FROM analytics.gold.sales_fact; + catalog: "analytics" + schema: "gold" + max_workers: 4 +``` + +### Output: Execution summary + +The tool returns results per statement, with an execution summary showing which queries ran in parallel and their individual timings. Results are ingested back into the analysis for column gap resolution and KPI data gap documentation. + +--- + +## Existing Metric View Detection (Gap 15) + +### Input: KPIs defined in Step 9 + +```markdown +## Domain: Sales +- Total Sales: SUM(total_amount) +- Avg Order Value: SUM(total_amount) / NULLIF(COUNT(1), 0) +- Sales YoY Growth: (window function with LAG) +- Customer Count: COUNT(DISTINCT customer_id) + +## Domain: Finance +- Gross Margin: (SUM(revenue) - SUM(cost)) / NULLIF(SUM(revenue), 0) +``` + +### Step 1: Discover existing metric views + +```sql +SELECT table_name, view_definition +FROM analytics.information_schema.views +WHERE table_schema = 'gold' + AND view_definition LIKE '%WITH METRICS%'; +``` + +Result: + +| table_name | view_definition | +|-----------|-----------------| +| sales_metrics | CREATE VIEW ... WITH METRICS LANGUAGE YAML AS $$ ... $$ | +| customer_metrics | CREATE VIEW ... WITH METRICS LANGUAGE YAML AS $$ ... $$ | + +### Step 2: Inspect each metric view + +``` +CallMcpTool: + server: "user-databricks" + toolName: "manage_metric_views" + arguments: + action: "describe" + full_name: "analytics.gold.sales_metrics" +``` + +Result shows measures: +- `Total Sales`: `SUM(total_amount)` +- `Order Count`: `COUNT(1)` + +``` +CallMcpTool: + server: "user-databricks" + toolName: "manage_metric_views" + arguments: + action: "describe" + full_name: "analytics.gold.customer_metrics" +``` + +Result shows measures: +- `Customer Count`: `COUNT(DISTINCT customer_id)` +- `Repeat Customer Rate`: `COUNT(DISTINCT CASE WHEN order_count > 1 THEN customer_id END) / NULLIF(COUNT(DISTINCT customer_id), 0)` + +### Step 3: Compare and classify + +| KPI Name | Domain | Classification | Existing View | Notes | +|----------|--------|---------------|---------------|-------| +| Total Sales | Sales | exists | sales_metrics | `SUM(total_amount)` matches | +| Avg Order Value | Sales | new | — | Not found in any view | +| Sales YoY Growth | Sales | new | — | Not found in any view | +| Customer Count | Customer | exists | customer_metrics | `COUNT(DISTINCT customer_id)` matches | +| Gross Margin | Finance | new | — | No finance_metrics view found | + +### Output: reference/existing_metric_views.md + +```markdown +## Existing Metric View Analysis + +### Discovery +Found 2 metric views in `analytics.gold`: +- `sales_metrics` (2 measures: Total Sales, Order Count) +- `customer_metrics` (2 measures: Customer Count, Repeat Customer Rate) + +### KPI Classification + +| KPI Name | Domain | Classification | Existing View | Notes | +|----------|--------|---------------|---------------|-------| +| Total Sales | Sales | exists | sales_metrics | Expression matches | +| Avg Order Value | Sales | new | — | Not in any existing view | +| Sales YoY Growth | Sales | new | — | Not in any existing view | +| Customer Count | Customer | exists | customer_metrics | Expression matches | +| Gross Margin | Finance | new | — | No finance view exists | + +### Views to Modify +- None (no `update` classifications in this run) + +### New Metric Views to Create +- **sales_metrics**: ALTER to add `Avg Order Value` and `Sales YoY Growth` + (same source table as existing `Total Sales`, so extend existing view) +- **finance_metrics**: CREATE new view with `Gross Margin` + +### Skipped (Already Deployed) +- Total Sales (in sales_metrics) +- Customer Count (in customer_metrics) +``` + +### Step 13 actions based on classification + +```sql +-- Extend existing sales_metrics with new measures +ALTER VIEW analytics.gold.sales_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + source: analytics.gold.sales_fact + dimensions: + - name: Order Month + expr: date_trunc('month', order_date) + - name: Region + expr: region + measures: + - name: Total Sales + expr: SUM(total_amount) + comment: "DAX: SUM(SalesFact[TotalAmount])" + - name: Avg Order Value + expr: SUM(total_amount) / NULLIF(COUNT(1), 0) + comment: "DAX: DIVIDE([Total Sales], COUNTROWS(SalesFact))" + - name: Sales YoY Growth + expr: (SUM(total_amount) - LAG(SUM(total_amount)) OVER (ORDER BY date_trunc('month', order_date))) / NULLIF(LAG(SUM(total_amount)) OVER (ORDER BY date_trunc('month', order_date)), 0) + comment: "DAX: DIVIDE([Total Sales] - CALCULATE([Total Sales], SAMEPERIODLASTYEAR(...)), ...)" +$$; + +-- Create new finance_metrics view +CREATE OR REPLACE VIEW analytics.gold.finance_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + source: analytics.gold.revenue_fact + dimensions: + - name: Period + expr: date_trunc('month', revenue_date) + - name: Business Unit + expr: business_unit + measures: + - name: Gross Margin + expr: (SUM(revenue) - SUM(cost)) / NULLIF(SUM(revenue), 0) + comment: "DAX: DIVIDE(SUM(Revenue) - SUM(Cost), SUM(Revenue))" +$$; + +-- Customer Count: SKIP (already deployed in customer_metrics with matching expression) +``` diff --git a/databricks-skills/databricks-powerbi-migration/REFERENCE.md b/databricks-skills/databricks-powerbi-migration/REFERENCE.md new file mode 100644 index 0000000..2a28ddd --- /dev/null +++ b/databricks-skills/databricks-powerbi-migration/REFERENCE.md @@ -0,0 +1,943 @@ +# Reference: Detailed Patterns + +This document provides detailed patterns for each gap identified during real-world testing of the PowerBI-to-Databricks skill. Each section is referenced from [SKILL.md](SKILL.md) workflow steps. + +--- + +## Gap 1: Intermediate Mapping Layer (Scenario D) + +When Power Query M expressions rename columns between the PBI semantic layer and the physical database, a direct name comparison fails. Scenario D handles this by extracting the renames and building a mapping. + +### Detection + +Look in the PBI model's `partitions[].source.expression` for M code containing: + +- `Table.RenameColumns` -- explicit column renames +- `Table.SelectColumns` -- column selection (implies name preservation) +- Schema parameter patterns -- `type table [ColName = type text, ...]` + +### Mapping Construction + +Build the **common two-layer mapping** (`pbi_column -> dbx_column`) by default. Use a **three-layer mapping** only where Power Query M introduces an intermediate rename: + +``` +Two-layer (default): + pbi_column -> dbx_column + +Three-layer (only when M renames are present): + pbi_column -> m_query_column -> dbx_column +``` + +### How to Extract M Renames + +1. Parse the PBI model JSON and locate `partitions` on each table. +2. For each partition with `source.type == "m"`, read `source.expression`. +3. Search for `Table.RenameColumns(...)` calls -- the argument is a list of `{old, new}` pairs. +4. Search for the `type table [...]` schema definition to find the final column names exposed to the PBI layer. +5. Map backward: PBI column name -> M expression column -> physical DB column. + +### Usage with compare_schemas.py + +Pass the intermediate mapping file via the `--mapping` flag: + +```bash +python scripts/compare_schemas.py \ + reference/pbi_model.json reference/dbx_schema.json \ + -o reference/schema_comparison.md --json \ + --mapping reference/intermediate_mapping.json +``` + +The mapping JSON format: + +```json +{ + "mappings": [ + { + "pbi_table": "SalesFact", + "dbx_table": "catalog.schema.sales_transactions", + "columns": [ + { + "pbi_column": "SaleID", + "m_query_column": "transaction_id", + "dbx_column": "transaction_id" + }, + { + "pbi_column": "TotalAmount", + "dbx_column": "amount_usd" + } + ] + } + ] +} +``` + +When `m_query_column` is absent, the mapping is treated as two-layer. + +--- + +## Gap 2: CSV Schema Dump Detection + +CSV files exported from `INFORMATION_SCHEMA` queries or database documentation tools often contain schema metadata. The scanner should detect these and treat them as equivalent to schema query output. + +### Detection Criteria + +A CSV file is classified as `csv_schema_dump` when its header row contains columns matching these patterns (case-insensitive, allowing underscores, spaces, or camelCase): + +- `table_name` / `tableName` / `TABLE_NAME` +- `column_name` / `columnName` / `COLUMN_NAME` +- `data_type` / `dataType` / `DATA_TYPE` + +At least `table_name` and `column_name` must be present. `data_type` is strongly expected but not strictly required. + +### Agent Behavior + +When a `csv_schema_dump` is detected: + +1. Parse the CSV to extract table names, column names, and data types. +2. Build a schema representation equivalent to `extract_dbx_schema.py` output. +3. Use this schema for comparison in Step 6. + +--- + +## Gap 3: Databricks-Only Column Gap Detection + +After schema comparison, DBX-only columns (columns present in Databricks but not referenced in the Power BI model) may be important for: + +- Filters and partitions in reports built outside PBI +- Discriminator columns that determine row subsets +- Audit/metadata columns needed for data governance + +### Column Gap Analysis Output + +The `compare_schemas.py` script produces `reference/column_gap_analysis.md` with: + +1. Every DBX-only column grouped by table +2. Discriminator flagging for columns that appear to be low-cardinality (naming heuristics: `status`, `type`, `category`, `result_type`, `is_*`, `flag_*`, `*_code`) +3. Suggested actions for each flagged column + +### Discriminator Heuristics + +A column is flagged as a potential discriminator if its name matches any of: + +- Contains `status`, `type`, `category`, `code`, `flag`, `class`, `kind`, `tier`, `level`, `group` +- Starts with `is_`, `has_`, `can_` +- Ends with `_type`, `_status`, `_code`, `_flag`, `_category`, `_class` + +### Output Format + +```markdown +## Column Gap Analysis + +### Table: catalog.schema.sales_fact +| Column | Data Type | Discriminator? | Suggested Action | +|--------|-----------|----------------|------------------| +| result_type | STRING | Yes | May filter report subsets -- run data discovery | +| etl_load_date | TIMESTAMP | No | Audit column -- likely not needed in reports | + +### Table: catalog.schema.customer_dim +| Column | Data Type | Discriminator? | Suggested Action | +|--------|-----------|----------------|------------------| +| customer_status | STRING | Yes | May be essential for active/inactive filtering | +``` + +--- + +## Gap 4: Data Discovery Query Generation + +After schema comparison and column gap analysis, auto-generate SQL queries to understand data values, distributions, and ranges. + +### Query Templates + +For **low-cardinality / discriminator columns**: + +```sql +SELECT DISTINCT FROM .. ORDER BY ; +``` + +For **value distribution**: + +```sql +SELECT , COUNT(*) AS cnt +FROM ..
+GROUP BY +ORDER BY cnt DESC +LIMIT 50; +``` + +For **date columns**: + +```sql +SELECT MIN() AS min_date, MAX() AS max_date +FROM ..
; +``` + +For **null rate analysis**: + +```sql +SELECT + COUNT(*) AS total_rows, + COUNT(*) - COUNT() AS null_count, + ROUND((COUNT(*) - COUNT()) * 100.0 / COUNT(*), 2) AS null_pct +FROM ..
; +``` + +### Output + +Save all generated queries to `reference/data_discovery_queries.sql`. The agent can: + +1. Run queries via MCP `execute_sql` if available +2. Present queries to the user to run manually +3. Ingest results back into the analysis + +--- + +## Gap 5: Catalog Resolution Strategy + +In multi-catalog environments, the agent must determine which catalog and schema contain the target tables. + +### Resolution Steps + +1. **Probe available catalogs**: + ```sql + SELECT catalog_name FROM system.information_schema.catalogs ORDER BY catalog_name; + ``` + +2. **Probe schemas within the target catalog**: + ```sql + SELECT schema_name FROM .information_schema.schemata; + ``` + +3. **Verify table existence**: + ```sql + SELECT table_name + FROM .information_schema.tables + WHERE table_schema = ''; + ``` + +### Handling fc_ Prefix + +Some environments prefix catalog names with `fc_`. The agent should: + +1. Try the catalog name as provided +2. If not found, try with `fc_` prefix +3. If not found, try without `fc_` prefix +4. Document both primary and fallback catalog in `reference/catalog_resolution.md` + +### Parallel Probing with Subagents + +When the catalog list contains multiple candidates, probe them concurrently using parallel `shell` subagents (see Gap 13 Pattern A). Each subagent calls `execute_sql` via the `user-databricks` MCP server to check table existence in its assigned catalog. This reduces catalog resolution from serial (N sequential queries) to parallel (all catalogs probed at once, max 4 concurrent). + +### Output + +Produce `reference/catalog_resolution.md`: + +```markdown +## Catalog Resolution + +- **Primary catalog**: `my_catalog` +- **Fallback catalog**: `fc_my_catalog` (if applicable) +- **Target schema**: `gold` +- **Tables found**: 15 (listed below) +- **Tables missing**: 2 (listed below) + +### Table Inventory +| Table | Catalog | Schema | Row Count (est.) | +|-------|---------|--------|------------------| +| sales_fact | my_catalog | gold | ~10M | +``` + +--- + +## Gap 6: Report Replication Workflow (Path B) + +When the goal is to replace Power BI reports with Databricks-native reports rather than reconnecting PBI: + +### Workflow + +1. Read the `databricks-aibi-dashboards` skill for dashboard creation patterns +2. Build a report specification from the PBI model's visual layout (pages, visuals, filters) +3. Generate `planreport/report_spec.md` with: + - Summary tables (KPI scorecards) + - Trend charts (time series by dimension) + - Narrative text blocks (dynamic text with metric values) + - Disclaimers and footnotes +4. Generate `planreport/email_template.md` for distribution specs +5. Use the `databricks-jobs` skill to schedule delivery + +### Report Specification Template + +```markdown +## Report: + +### Page 1: Executive Summary +- **Visual 1**: KPI scorecard (Total Sales, Avg Order Value, Customer Count) +- **Visual 2**: Monthly trend line (Total Sales by Month) +- **Visual 3**: Top 10 table (Products by Revenue) +- **Filters**: Date range, Region, Product Category + +### Page 2: Detail View +- **Visual 1**: Table with drill-through (Order details) +- **Filters**: All from Page 1 + Customer Segment +``` + +--- + +## Gap 7: KPI Definitions as First-Class Artifact + +KPI definitions should be structured, not informal. The agent produces `kpi/kpi_definitions.md` with a standardized template per KPI. + +### Template + +```markdown +### KPI: +- **Business Context**: +- **DAX Formula**: `` +- **SQL Equivalent**: `` +- **Source Table**: +- **Format**: +- **Data Gaps**: +- **Domain**: +``` + +### Organizing KPIs + +Group KPIs by domain. Within each domain, order by importance (primary KPIs first, derived KPIs after). + +```markdown +# KPI Definitions + +## Domain: Sales +### KPI: Total Sales +... +### KPI: Avg Order Value +... + +## Domain: Finance +### KPI: Gross Margin +... +``` + +--- + +## Gap 8: Sample Report / Document Analysis + +When `input/` contains sample reports or documents (`.docx`, `.pdf`, `.png`, `.jpg`, `.xlsx`, `.pptx`), the agent should reverse-engineer the report's structure. + +### What to Extract + +- **KPI names and values** visible in the report +- **Column formatting** (currency symbols, decimal places, date formats) +- **Chart types** (bar, line, pie, table, scorecard) +- **Narrative templates** (dynamic text patterns like "Sales increased by X% compared to...") +- **Disclaimers and footnotes** +- **Branding** (colors, logos, headers/footers) +- **Filter/slicer positions** and default values + +### Output + +Produce `reference/report_analysis.md`: + +```markdown +## Report Analysis: + +### KPIs Identified +| KPI | Value (as shown) | Likely Measure | Format | +|-----|-------------------|----------------|--------| +| Total Revenue | $1.2M | SUM(revenue) | Currency, 1 decimal | + +### Visuals +| # | Type | Title | Dimensions | Measures | +|---|------|-------|------------|----------| +| 1 | Scorecard | Key Metrics | - | Total Revenue, Order Count | +| 2 | Line Chart | Monthly Trend | Month | Total Revenue | + +### Narrative Templates +- "Revenue for {period} was {Total Revenue}, a {YoY Change}% change from the prior year." + +### Disclaimers +- "Data as of {last_refresh_date}. Excludes returns processed after close." +``` + +--- + +## Gap 9: Cross-Schema and INFORMATION_SCHEMA Probing + +In multi-schema and multi-catalog environments, extend schema queries to cover all relevant schemas. + +### Cross-Schema Column Comparison + +```sql +SELECT table_schema, table_name, column_name, data_type +FROM .information_schema.columns +WHERE table_schema IN ('', '') +ORDER BY table_schema, table_name, ordinal_position; +``` + +### Discover All Schemas in a Catalog + +```sql +SELECT schema_name FROM .information_schema.schemata; +``` + +### Discover All Catalogs + +```sql +SELECT catalog_name FROM system.information_schema.catalogs; +``` + +### When to Use Cross-Schema Probing + +- PBI model references tables from multiple schemas +- Table names exist in multiple schemas (need to disambiguate) +- Migration involves consolidating schemas + +--- + +## Gap 10: Report Distribution Artifacts + +The `planreport/` folder contains all artifacts needed for Databricks-native report delivery. + +### Folder Structure + +``` +planreport/ +├── report_spec.md # Visual layout, chart specs, narrative blocks +├── email_template.md # Recipients, schedule, subject, body template +└── deployment_config.yml # Job schedule, warehouse, notification targets +``` + +### Email Template + +```markdown +## Email Distribution + +- **Recipients**: [list of email addresses or groups] +- **Schedule**: Weekly, Monday 8:00 AM UTC +- **Subject**: "{Report Name} - Week of {date}" +- **Body**: See narrative template from report_spec.md +- **Attachments**: PDF export of dashboard +- **Format**: HTML with inline charts +``` + +### Deployment Configuration + +```yaml +report_name: "Sales Weekly Report" +warehouse_id: "" +schedule: + quartz_cron: "0 0 8 ? * MON" + timezone: "UTC" +notifications: + on_success: + - email: "team@company.com" + on_failure: + - email: "admin@company.com" +``` + +--- + +## Gap 11: Deployment Checklist Generation + +The final artifact is an ordered checklist of steps to go from local artifacts to a running, scheduled report. + +### Checklist Template + +```markdown +## Deployment Checklist + +### Pre-Deployment +- [ ] Validate catalog access: `SELECT 1 FROM ..
LIMIT 1` +- [ ] Verify all source tables exist and are accessible +- [ ] Confirm SQL warehouse is running and sized appropriately + +### Metric View Deployment +- [ ] Deploy metric views: run each SQL file in models/metric_views/ +- [ ] Verify metric views: `SELECT MEASURE(...) FROM LIMIT 10` +- [ ] Grant SELECT to required users/groups + +### Report Deployment (choose path) +#### Path A: Power BI Reconnection +- [ ] Update Power Query M formulas to use Databricks connector +- [ ] Parameterize ServerHostName and HTTPPath +- [ ] Set DirectQuery for facts, Dual for dimensions +- [ ] Set "Assume Referential Integrity" on relationships +- [ ] Test all report pages for data accuracy +- [ ] Publish to Power BI Service +- [ ] Update stored credentials in Power BI Service + +#### Path B: Databricks-Native Report +- [ ] Create AI/BI dashboard from report_spec.md +- [ ] Configure job schedule from deployment_config.yml +- [ ] Set up email distribution from email_template.md +- [ ] Test dashboard rendering and data accuracy + +### Post-Deployment +- [ ] Run validation queries against metric views +- [ ] Compare output values with original PBI report +- [ ] Share deployment summary with stakeholders +- [ ] Document any known gaps or deferred items +``` + +--- + +## Gap 12: Query Access Optimization + +Before constructing metric views, assess query complexity, table size, grain, and access patterns to choose the optimal serving strategy. This prevents building metric views that are too slow for interactive use or unnecessarily expensive. + +### Assessment Criteria + +For each KPI or domain, evaluate: + +| Factor | How to Assess | Threshold | +|--------|---------------|-----------| +| **Table size** | `DESCRIBE DETAIL
` -- check `sizeInBytes` and `numFiles` | < 100 GB = small, 100 GB - 1 TB = medium, > 1 TB = large | +| **Row count** | `SELECT COUNT(*) FROM
` or estimate from DESCRIBE DETAIL | < 100M = small, 100M - 1B = medium, > 1B = large | +| **Join count** | Count the number of tables joined per KPI query | 0-2 = simple, 3-5 = moderate, 6+ = complex | +| **Aggregation complexity** | Window functions, CASE expressions, nested aggregations | Simple SUM/COUNT = low, window/CASE = medium, nested = high | +| **Grain mismatch** | Compare fact table grain to report grain | Same grain = no issue, different grain = pre-aggregation needed | +| **Filter selectivity** | Typical filter narrows result to what % of table? | > 50% = low selectivity, < 10% = high selectivity | +| **Refresh frequency** | How often does the source data change? | Real-time, hourly, daily, weekly | + +### Query Complexity Scoring + +Assign a score to determine the serving strategy: + +``` +Score = table_size_score + join_score + aggregation_score + grain_score + +table_size_score: small=0, medium=2, large=4 +join_score: 0-2 joins=0, 3-5=1, 6+=2 +aggregation_score: simple=0, medium=1, high=2 +grain_score: same=0, different=2 +``` + +| Total Score | Serving Strategy | +|-------------|-----------------| +| 0-2 | Standard metric view (direct on source tables) | +| 3-5 | Materialized view with scheduled refresh | +| 6+ | Gold-layer aggregate table + metric view on top | + +### Serving Strategies + +#### Strategy 1: Standard Metric View + +For simple KPIs on small/medium tables with few joins. The metric view queries source tables directly at runtime. + +```sql +CREATE OR REPLACE VIEW ..sales_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + source: ..sales_fact + measures: + - name: Total Sales + expr: SUM(total_amount) +$$; +``` + +#### Strategy 2: Materialized View with Incremental Refresh + +For complex KPIs or medium/large tables where pre-computing aggregations significantly reduces query time. Databricks automatically manages incremental refresh. + +```sql +CREATE OR REPLACE MATERIALIZED VIEW ..monthly_sales_agg +AS +SELECT + date_trunc('month', order_date) AS order_month, + region, + product_category, + SUM(total_amount) AS total_sales, + COUNT(1) AS order_count, + COUNT(DISTINCT customer_id) AS unique_customers +FROM ..sales_fact +GROUP BY ALL; + +ALTER MATERIALIZED VIEW ..monthly_sales_agg + SCHEDULE CRON '0 2 * * *' AT TIME ZONE 'UTC'; +``` + +Then point the metric view at the materialized view: + +```sql +CREATE OR REPLACE VIEW ..sales_metrics +WITH METRICS LANGUAGE YAML AS $$ + version: 1.1 + source: ..monthly_sales_agg + dimensions: + - name: Order Month + expr: order_month + - name: Region + expr: region + measures: + - name: Total Sales + expr: SUM(total_sales) + - name: Avg Order Value + expr: SUM(total_sales) / NULLIF(SUM(order_count), 0) +$$; +``` + +#### Strategy 3: Gold-Layer Aggregate Table + +For very large tables (> 1 TB) or when the grain mismatch is severe (e.g., transaction-level fact table but report needs monthly aggregates). Build a dedicated gold-layer table with a pipeline for incremental loads. + +```sql +CREATE TABLE ..gold_monthly_sales ( + order_month DATE, + region STRING, + product_category STRING, + total_sales DECIMAL(18,2), + order_count BIGINT, + unique_customers BIGINT, + _etl_updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() +) +USING DELTA +CLUSTER BY (order_month, region); +``` + +Use `spark-declarative-pipelines` to maintain incremental refresh, then build metric views on top of the gold table. + +### Grain Analysis + +When the fact table grain is finer than the report grain, pre-aggregation is essential: + +| Fact Table Grain | Report Grain | Action | +|-----------------|--------------|--------| +| Transaction-level | Daily | Materialized view with `date_trunc('day', ...)` | +| Transaction-level | Monthly | Gold-layer aggregate or materialized view | +| Daily | Monthly | Materialized view with `date_trunc('month', ...)` | +| Same | Same | Standard metric view (no pre-aggregation) | + +### Output + +Produce `reference/query_optimization_plan.md`: + +```markdown +## Query Optimization Plan + +### Domain: Sales +| KPI | Score | Strategy | Rationale | +|-----|-------|----------|-----------| +| Total Sales | 2 | Standard metric view | Simple SUM, table < 50 GB | +| Sales YoY Growth | 5 | Materialized view | Window function over 2 years, 200 GB table | +| Customer Lifetime Value | 7 | Gold-layer aggregate | 5-table join, 1.2 TB fact table, transaction-to-monthly grain | + +### Materialized Views to Create +1. `monthly_sales_agg` -- monthly pre-aggregation for time-series KPIs + - Schedule: daily at 2:00 AM UTC + - Source: sales_fact (200 GB) + - Estimated refresh time: ~15 min + +### Gold-Layer Tables to Create +1. `gold_customer_ltv` -- customer lifetime value aggregate + - Pipeline: notebooks/customer_ltv_pipeline.py + - Refresh: daily incremental via SDP +``` + +--- + +## Gap 13: Subagent Parallelization Patterns + +Use `Task` subagents (`subagent_type="shell"`) to parallelize MCP tool calls when work spans multiple catalogs, schemas, or tables. Each subagent runs independently and returns its results. Max 4 concurrent subagents. + +### Decision Guide: Subagents vs. `execute_sql_multi` + +| Scenario | Best Tool | Why | +|----------|-----------|-----| +| N queries, same catalog | `execute_sql_multi` | Single MCP call, built-in parallelism (up to 4 workers), simpler | +| N queries, different catalogs | Parallel `shell` subagents | Each needs a different catalog context; `execute_sql_multi` takes only one | +| 1 query per catalog for probing | Parallel `shell` subagents | Each subagent probes one catalog independently | +| Sizing N tables in same catalog | `execute_sql_multi` | Combine N `DESCRIBE DETAIL` into one call | +| Sizing tables across catalogs | Parallel `shell` subagents | Different catalog contexts | + +### Pattern A: Parallel Catalog Probing + +When Step 3 identifies multiple candidate catalogs, launch one subagent per catalog to verify which contains the target tables. + +**Subagent prompt template** (one per catalog): + +``` +Probe catalog "" for tables in schema "" using the Databricks MCP server. + +1. Call CallMcpTool with: + - server: "user-databricks" + - toolName: "execute_sql" + - arguments: {"sql_query": "SELECT table_name FROM .information_schema.tables WHERE table_schema = '' ORDER BY table_name"} + +2. Return the list of table names found, or state that no tables were found. +``` + +Launch up to 4 of these concurrently: + +``` +Task(subagent_type="shell", prompt="